/aiman.tech
← NotesNote 002 · Parallax · K-DOF

The model stops and ponders

Published··8 min·
ParallaxCWSK-DOFTest-time compute

One observation from training the v3 CWS that I keep showing people. The model autonomously escalates compute on hard tokens, without ever being told what 'hard' means. Five to six times more reasoning iterations on harder lookups, paid for at constant memory.

The single most exciting moment from training the v3 CWS, for me, was looking at a per-token plot of K (the number of reasoning iterations the workspace took) and realising the model was doing something I had not asked for.

I had given it a budget of Kmax=24K_\text{max}=24 iterations. The halt criterion was a simple raw-attention momentum readout - settle when the change between successive raw-logit tensors saturates. Easy tokens settle fast, hard tokens take longer. Nothing in the loss told the model to think harder on harder tokens. Nothing told it what 'hard' was. I had just exposed the option.

On synthetic BIND/RECALL tasks, the Mamba-substrate run used about 18{\sim}18 of the 24 iterations on hard recall tokens, and 7.5{\sim}7.5 on easy ones. The split was clean. The model had figured out, on its own, that some lookups need more pondering and others do not.

Reading the user-side framing literally: the model stops and ponders harder problems.

Why this is more than a halt criterion

Halt criteria for adaptive-depth networks are not new. Adaptive Computation Time, PonderNet, the dropping-blocks literature all proposed early-exit or variable-depth mechanisms. Mostly they did not catch on, because the gain on standard benchmarks was small and the implementation overhead was real.

What feels different about the CWS version is the architecture wrapping the halt. The workspace is a recurrent dynamical system being perturbed by experts. The number of iterations is the number of times that dynamical system gets to settle. So the halt is not 'should we exit early' - it is 'has the system found a basin yet'.

When that distinction is the architecture rather than a side mechanism, you get a few things for free. The compute scales with the difficulty of the actual perturbation. There is no externalised chain-of-thought to prompt or budget. Test-time compute is a property of the forward pass, not a feature of how you invoke the model.

The constant-memory part

What makes this practical is the training contract. K-1 iterations under no_grad, then one live final step that gets all the gradient. Memory cost is essentially flat in K - I have measured 95{\sim}95 MB at K=8K=8 and the same at K=48K=48 on small experiments.

mem(K)const,K[8,48]\text{mem}(K) \approx \text{const}, \quad K \in [8, 48]
Empirical: training memory does not grow with K. K controls reasoning depth; gradient depth is a separate axis (one live step).

That is what makes 'reasoning depth as a separate axis from parameter capacity' more than a slogan. You can keep the parameter count fixed and let the model spend more or less compute per token at inference; you can increase KmaxK_\text{max} during training without paying for it in VRAM. The two axes are now independent.

The honest counterpoint

I want to be careful here. On synthetic recall, K-DOF is real and graded. On TinyStories at 6k steps with Kmax=12K_\text{max}=12, the same hypothesis fails. K allocation is essentially uniform across surprisal quantiles. At natural-language scale, in this regime, K-DOF looks more like a binary 'is this a recall event?' signal than a continuous reasoning-depth signal.

There is also a darker reading of the same plot. Strange-attractor dynamics - the oscillation, overshoot, recovery - are partly a stress signal. The CWS uses more K when the substrate is undersized for the lookup. Sometimes 'pondering harder' is 'thrashing because the easy answer is not in this representation'. K-DOF rises and the perplexity does not improve. The model burned compute trying.

Why I am still excited

Because if the gradation does generalise - even partially - then we have a way of getting test-time-compute scaling out of the architecture itself, not out of prompt engineering. Chain-of-thought is a beautiful kludge but it is a kludge. It works because the externalised text gives the model more substrate to pass intermediate state through. The CWS lets you do that internally, with the dynamics of a small recurrent state, at constant memory cost.

And because it was not designed in. The K-DOF behaviour fell out of the architecture. That is the kind of result I trust most - the kind I did not have to work for.