Parallax: what I got wrong three times
Parallax v3 is the third honest restart. v1 was a RAG replacement that hit a ceiling no projector architecture could lift. v2 was a dual-process system with a memory bank that worked best when it collapsed to a single vector. v3 is what I have now. Notes from getting the project wrong and learning more from each version than from any one of them being right.
I have restarted Parallax twice. Each time I thought the previous version was the project. Each time I was wrong about the project. The reason I am writing this is that the lessons learned from being wrong feel more honest than the architecture I ended up with - and I want to remember them when I am inevitably wrong about v3.
v1: Parallax-as-RAG-replacement
v1 was an attempt at speculative decoding through learned retrieval. The backbone was a fine-tuned LFM2.5; I gave it a special token that competed with the next-token vocab. When the model emitted that token, an LSH retrieval index produced a candidate span that got verified in parallel by the same backbone. If the verification accepted, the span got rewritten back to memory in its accepted form - a self-refining store.
Stages 0 through 4 validated the components. Stage 5 closed the gap of getting the backbone to actually emit the token at the right time, via a full finetune with a discrepancy-conditioned soft target. v6.2 and v8.x closed the gap of retrieval diversity. By the time I got to gap 3 - relevance - I had run the asymmetric projector ablation across half a dozen configurations.
Every one of them topped out at roughly 25% validation WD-1 accuracy. The result that finally killed v1 was a single line in the handoff doc:
~25% val WD-1 ceiling regardless of architecture. The backbone hidden states are the bottleneck, not the projection.
Translation: there was no architectural fix. The retrieval lane could not be made better than the representation it was reading from. The token-level interface was the fundamental constraint.
What v1 taught me
- Backbone-sidecar entanglement is a load-bearing decision and I had it wrong. If the sidecar's gradient depends on the backbone, the sidecar inherits all of the backbone's blind spots.
- Discrete retrieval is too coarse. Retrieving spans of tokens forces the model to commit to a chunk before it knows what it needs.
- Token-level interfaces are brittle. Anything that has to compete with the next-token softmax is going to lose, because the next-token softmax is what the model was trained to be good at.
- No test-time learning. The system could not adapt the retrieval index from interaction. It was a static lookup.
v2: PFC + SC dual-process
v2 abandoned retrieval. It went all in on continuous-latent prediction. Two cooperating systems: a PFC (the transformer backbone) trained by cross-entropy, and an SC (a bidirectional attention module over a persistent memory bank) trained by the gradient of an explicit cosine energy:
The fusion mechanism iterated three times. v2.0 was a scalar blend; v2.1 was per-layer fusion gates which an ablation showed were behaving as adapters (random SC and trained SC produced the same result, which meant the gates were doing all the work and the SC was decorative). v2.2 was a single shared fusion gate doing a full residual lerp between PFC and SC proposals. That one stuck.
The v2 result that surprised me most
After enough training, every memory embedding in the SC's bank converged to almost the same vector. All 128 of them, within cosine 0.94 of each other. The 'memory bank' was effectively a single learned residual bias, replicated 128 times. I tried to fix it. I added diversity-preserving losses. I forced spectral spread. Every variant that successfully kept the embeddings diverse .
What was actually happening: the SC's value was not in storing diverse content. The SC was learning to live on the PFC's cognitive manifold, providing a single well-aligned bias vector that the gate could selectively apply. The gate was doing the content-specific work; the SC was providing one stable thing for it to apply. Diversity moved the SC off-manifold and the gate could not compensate.
The other v2 surprise: in some ablations, having the SC alone (PPL 22.9) matched having the full oracle context provided in the prompt (PPL 23.2). In one held-out persona, oracle context - the gate was so calibrated to the SC's bias that injecting fresh in-context information shifted the PFC into a distribution the gate had not seen. The gate's calibration had become more important than the actual evidence.
M29c: the bridge to v3
The bridge was a single experiment, M29c. I replaced the SC's bidirectional-attention update rule with a state-space-style accumulation:
M29c-DEQ added iterative depth at test time, with the blend gate feeding back. By the time I had run a few dozen variants of that, the system did not look like 'a dual-process model with a memory bank' anymore. It looked like a small recurrent dynamical system being perturbed by associative content. PFC and SC had collapsed into one thing. The 'memory bank' had become 'the persistent state of a contractive map'. The energy formulation had become attractor-flavoured.
v3 is what fell out when I stopped trying to make M29c look like the v2 framing and let it be what it was.
What carried through
Looking back, four things survived all three versions:
- Hidden-state space, not token space. Whatever the system is doing, it is doing it on continuous representations, not over discrete vocabulary.
- A continuous test-time learning surface. The system has internal state that is supposed to update during inference, not just during training.
- Prediction over reference geometry, not over tokens. Every loss was about matching some internal representation to another internal representation, even when the outer layer was a CE.
- Asymmetric gradient between a fast associative module and a slow backbone. The fast part adapts quickly; the slow part does not get destabilised by the fast part.
Those four bullets are the project, more honestly than any one architecture has been. v1, v2, v3 were three different ways of organising those four commitments. The first two were wrong about the organisation. The third one might also be wrong; I will find out.
Why I am writing this
Because I noticed I was about to forget the lessons. v3 is exciting. It is the cleanest version of the project so far. It has the constant-memory result, the K-DOF observation, the modality-agnostic core. All of those are real. None of them mean v3 is the final form. v1 felt right while I was working on it; so did v2. The version that feels right is exactly the version you should be most cautious about.
The thing I want to remember, when v3 hits whatever its ceiling is: most of what taught me anything came from being wrong specifically. Not from the failure - failure on its own teaches you nothing - but from the diagnostic that explained the thing was wrong in a way the next version could engineer around.