/aiman.tech
← NotesNote 004 · Surprises

Memory collapse is a feature

Published··8 min·
Parallaxv2SurprisesInductive bias

Halfway through training a memory-augmented architecture, I watched the entire memory bank collapse to a single vector. I spent a month trying to fix it. Then I looked at the numbers, realised the collapse was actually the optimum, and unlearned a thing I thought I knew about memory.

There is a class of result that I have started to trust more than the ones I designed for. It is the result where the model does something I would have called a bug, and on inspection turns out to be doing the right thing in a way I did not understand.

v2 of Parallax had a memory bank: 128 embeddings of dimension dmodeld_\text{model}, attended over by the SC's bidirectional attention head, trained only by the gradient of an energy term. The intuition was the obvious one. Different embeddings would specialise to different concepts; the attention head would route to the relevant ones; over training the bank would partition the input distribution into a useful set of stored slots.

What actually happened: after about 1.12M tokens, all 128 embeddings had drifted to within cosine 0.94 of each other. The attention distribution was nearly uniform. Removing any subset of the embeddings had negligible effect on validation perplexity. The memory bank was, in any meaningful sense, a single learned vector replicated 128 times.

My first reaction was that this was a training failure. My second reaction was that this was actually the optimum.

The fixes that did not fix anything

I did the obvious things. Spectral diversity loss. Orthogonality regularisation. Forced top-k routing instead of soft attention. Re-initialisation halfway through training. Each of them succeeded at the proximate goal: the embeddings stayed diverse, the attention distribution stayed bumpy, the bank looked the way I expected a 'memory' to look.

And every one of those variants performed worse on the actual task. Diversity correlated negatively with benefit. The diverse memory banks were less useful than the collapsed one.

What was actually going on

After staring at the ablation table for too long, the picture got clearer. The SC's value to the system was not in the content of its memory embeddings. It was in providing a stable, well-aligned bias vector that the fusion gate could selectively apply.

The fusion gate was the thing doing the content-specific work. The SC was providing one well-tuned thing for the gate to apply. Diversity was actively harmful because diverse embeddings meant the SC could not provide a stable target - it kept moving off the PFC's cognitive manifold, and the gate, which was trained against a specific SC distribution, could not compensate.

Once you accept that, the memory-bank framing falls apart. There is no 'storage' going on. There is a learnable shared bias that the rest of the system calibrates against, replicated 128 times because that is the shape the architecture happened to expose.

The bigger pattern

The reason this matters beyond the v2 ablation table is that I think the same pattern shows up other places.

  • Mixture-of-experts with routing collapse. We treat it as a failure mode and add load-balancing losses. Sometimes the collapse is the model telling us the experts were not the right unit of specialisation.
  • Attention heads that all attend to roughly the same place. We prune them or add diversity penalties. Sometimes they are the same head because the task only needs one.
  • Curriculum learning that 'fails' because the model figures out the hard examples directly. We blame the curriculum. Sometimes the model is right and the curriculum was wrong about what was hard.

The unifying thing is: when you put a structure into the architecture and the model collapses it, your default move should not be 'force the structure to stay open'. It should be 'check whether the structure was load-bearing in the first place'.

How this fed into v3

Once memory had stopped being a 'storage' problem, the question changed. What is the SC actually doing\textit{doing}? It was responding to perturbations from the PFC's hidden state and proposing a residual update that the gate could fold in. The 'memory' framing was wrong; what I had was a small recurrent module being perturbed by an external signal and producing a contracting update.

M29c, which became the bridge to v3, replaced the SC's attention update with a cumulative, decayed, log1p-compressed accumulation rule. The bank shape disappeared. What was left was a contractive map perturbed by raw attention. That is the v3 inner loop. The memory bank had not been the memory bank for a long time; once I let it be what it was, the architecture started writing itself.

The thing I keep coming back to is: I lost a month trying to fix the collapse, and the moment I stopped trying to fix it, the project moved forward. The collapse was the most informative result of v2. I was not paying attention because it did not match the story I had built for what the system was supposed to be doing.