A geometric inductive bias for transformer reasoning
The central object is a fiber bundle \(\pi: E \to M\) where the base manifold \(M\) is a Poincare ball \(\mathbb{B}^{64}\) with constant curvature \(\kappa = -1\), and each fiber \(F_x\) is a categorical distribution over \(K=16\) sections with \(P=8\) mixture components.
Hidden states from transformer layer 12 are projected into this bundle. The projection is a residual perturbation clamped to ≤10% of the base hidden state norm, preserving the pretrained distribution.
The adapter learns a metric tensor \(g_{ij}\) on \(M\), approximated via the Fisher information matrix of the fiber distributions. Curvature is computed via a log-determinant Laplacian estimator:
\[ K \approx -\frac{1}{2} \nabla^2 \log \det g \]
A curvature loss regularizes toward \(\kappa = -1\). The learned curvature at convergence is \(K = -5.63\) — strongly hyperbolic, overshooting the target but geometrically meaningful.
Fiber state evolves via a Hamiltonian system with symplectic integration and a Lorentz-factor speed limiter (\(c = 5.0\)):
\[ \dot{q} = \frac{\partial H}{\partial p}, \quad \dot{p} = -\frac{\partial H}{\partial q} \]
The speed limiter prevents gradient explosion: \(v \to v / \sqrt{1 + \|v\|^2 / c^2}\). Combined with the SymplecticSPIDER optimizer, this replaces global gradient clipping (which was found to kill fiber gradients due to the 7M-parameter curvature norm dominating).
Optimization follows the natural gradient direction on the statistical manifold:
\[ \theta_{t+1} = \theta_t - \eta \, F^{-1} \nabla_\theta \mathcal{L} \]
This respects the Fisher-Rao metric and converges ~30% faster than Euclidean gradient descent.
Adjacent tokens must agree on their fiber distributions. The Jensen-Shannon divergence between neighboring sections is penalized:
\[ \mathcal{L}_{\text{sheaf}} = \frac{1}{N} \sum_{i} \text{JSD}(\sigma_i \| \sigma_{i+1}) \]
This enforces local-to-global semantic coherence — the "gluing condition" of the sheaf.
At inference time, the GSP reads curvature \(K\) and entropy \(S\) from the adapter and modulates generation parameters (temperature, top_p) via a GENERIC-compliant homeostatic controller. No retraining required.
Biologically-inspired memory replacing the 3-tier IMD system:
Gradio-based inference UI with real-time telemetry: curvature heatmap, entropy gauge, fiber distribution visualization, thought trace, generation stability metrics. OOM-hardened for 5 GiB VRAM GPUs with dynamic token budgeting.
| Metric | Value | Interpretation |
|---|---|---|
| Curvature K | -5.63 | Strongly hyperbolic |
| Entropy S | 0.95 | Below uniform (ln16 ≈ 2.77), sections specialized |
| Jensen-Shannon Div. | 0.424 | Fibers differ across contexts |
| Parallel Transport | 0.041 | Near-zero holonomy — geometric consistency |
| Faithfulness | 6/6 PASS | All verification tests pass |
| Benchmark | Score | Notes |
|---|---|---|
| ARC-Challenge | 54.86% | Identical to base Qwen 2.5-7B |
| TruthfulQA (MC2) | 64.78% | Strong factual grounding |
| Winogrande | 71.03% | Commonsense reasoning intact |
| GSM8K | 75.51% | Multi-step math preserved |
Entropy was frozen at \(S = \ln(16) \approx 2.77\) for 3000 training steps due to 10 coupled gradient blockages (arcosh NaN, per-component Poincare projection, global grad clipping killing fiber grads, etc.). After systematic diagnosis and Phase A fixes, entropy unfroze to oscillating \([0.58, 1.04]\), mean ~0.85.
13 systematic ablations isolating each geometric component. Key findings:
| Component Removed | Accuracy Drop |
|---|---|
| Euclidean Target (κ=0) | -10.9% |
| No Curvature Loss | -9.5% |
| No Natural Gradients | -8.4% |
| No Sheaf Consistency | -5.6% |
| No Bundle Structure | -4.9% |