/aiman.tech
← NotesNote 001 · Architecture

Parallax: cognition without the token clock

Published··15 min·
ParallaxCWSAttractor dynamicsDEQArchitecture

What I have been working on for almost two years across three rewrites. A small recurrent core that thinks by settling, language treated as one of several sensory channels, and reasoning depth detached from the token loop. Notes from the v3 codebase and what I am still trying to figure out.

I keep coming back to a feeling I cannot fully justify. That cognition - the kind that does anything genuinely flexible - is closer to a dynamical system finding a basin than to a feed-forward function getting unrolled deeper.

Modern transformers are extraordinary feed-forward functions. Stack a residual block, stack another, route every token through every layer in lockstep. The architecture has one clock - the token clock - and cognition, if there is any, has to fit inside one forward pass per token. That assumption is so deeply baked into how we build models that it is easy to forget it is an assumption.

Parallax is my long bet against that assumption. It is in v3 now. Took me two restarts to figure out what it actually was.

The shape of it

The current system has a small recurrent core - the Central Workspace, hRdh \in \mathbb{R}^{d} - and a heterogeneous pool of experts. The CWS does not see raw inputs. Tokens are summarised by experts (an attention probe, a Mamba SISO read, an MLP probe), and the workspace reads from those summaries. At each context update the CWS iterates K times, settling, then a thin motor decoder reads the settled state and emits the next token.

Token steps and CWS steps are decoupled. Quiet, predictable context settles in K=1K=1. Surprising context burns more. The CWS does KK reasoning steps, not token steps.

TOKEN STREAMx_tSENSORY OBSERVATIONx_t-1x_t-2EXPERT POOLMamba SISOTOKEN-CLOCKED WRITEAttn probeROPE · H-CLOCKED READMLP probeSTATELESS · H-ONLYSIRENPERIODIC PRIORproposals (B,T,S,d)CENTRAL WORKSPACEh, cumKK REASONING ITERATIONShalt: ‖raw_k − raw_{k-1}‖ saturatesMOTOR DECODERDecoder2 LAYERS · CTX ≤ 16 · TIED LM HEADsettled hOUTPUTx_t+1NEXT TOKENTOKEN STEPS · ONE PER CONTEXT UPDATEK STEPS · MANY PER CONTEXT UPDATEDECODE · ONE PER TOKEN
[ FIG · 03 / Parallax · data flow ]
K reasoning steps decoupled from token steps
Fig 01 - End-to-end data flow. Tokens fan into the expert pool, experts feed proposals into the CWS K loop, the settled state is read by a thin motor decoder.
[ FIG · 01 / Parallax · CWS topology ]
live · interactive
Workspace
Specialist
Proposal
Broadcast
Fig 02 - CWS topology, interactive. Hover any peripheral to highlight its proposal; the gold pulse is the broadcast back to the rest of the pool.

The inner loop

Two things carry across the K loop: the workspace embedding hh and a cumulative raw-attention buffer cum\text{cum}. The update is:

raw=Q(h)K(pool)dh\text{raw} = \frac{Q(h)\,K(\text{pool})^\top}{\sqrt{d_h}}
Q reads from h; K, V read from the pool of expert outputs. raw is the unnormalised attention logits.
cumδcum+(1δ)+raw\text{cum} \leftarrow \delta \cdot \text{cum} + (1-\delta) + \text{raw}
Decay applies to the RAW logits, not weights. cum drifts toward 1, perturbed by raw - a contractive map with input perturbation.
w=log ⁣(1+silu(cum))ln(b),hO ⁣(swsVs)w = \frac{\log\!\left(1 + \mathrm{silu}(\text{cum})\right)}{\ln(b)}, \quad h \leftarrow O\!\left(\sum_s w_s \, V_s\right)
Weights are log1p-compressed, sublinear, NOT softmax-normalised. Slots compete on absolute energy, not partition. O_proj is the closure inside the loop.

Three things matter more than they look. First, decay applies to the raw logits, not the weights. Compressing then decaying destroys the dynamics. Second, no softmax: weights are non-normalised, so slots compete on absolute energy rather than distributing a fixed mass. That is what makes the system an attractor instead of a routing distribution. Third, cum\text{cum} decays toward 1, not 0. The +(1δ)+(1-\delta) term is a pull toward unit attractor mass - without it the contractive map collapses to zero everywhere.

Cognition is not a forward function with extra layers. It is a contractive map being perturbed and pulled back toward a basin.

Why this is not DEQ

The CWS is conceptually closest to a DEQ under associative constraints - a Banach-style contraction toward an input-perturbed regime. But the operating point is not a true fixed point. It is a metastable basin in a strange-attractor landscape. The model overshoots, oscillates, and recovers. Successive iterations do not monotonically reduce Δh\|\Delta h\|.

So I do not detect equilibrium by chasing the difference between successive hidden states to zero. I use a momentum readout on the raw logits:

mk=rawkrawk1rawk1m_k = \frac{\|\text{raw}_k - \text{raw}_{k-1}\|}{\|\text{raw}_{k-1}\|}
The model halts when this raw-attention momentum saturates. Empirically, that aligns with what the system does naturally - settle into a basin and stop refining.

The training contract makes the rest of the architecture pay rent. K-1 frozen iterations under no_grad, then one live final step that receives all the gradient. Reasoning depth and gradient depth decouple. Memory cost is constant in K.

step 1NO_GRADstep 2NO_GRADstep 3NO_GRADstep 4NO_GRADstep 5NO_GRADstep 6NO_GRADstep 7NO_GRADstep 8NO_GRADstep 9LIVE · GRADK - 1 ITERATIONS · UNDER no_gradONE LIVE STEP · GRADIENT FLOWSLOSS → ∇TIME · K → K_max
[ FIG · 04 / Training contract ]
K-1 frozen + 1 live = O(1) memory in K
Fig 03 - The training contract. Only the final step participates in autograd; everything before it is frozen. Memory does not grow with K.

Strange-attractor dynamics are emergent

The dynamics that I find most interesting - the model overshooting, oscillating, then quenching - are not built into the architecture. They are something the model learns to do when you give it room.

Cap K at 4 during training and the CWS contracts monotonically. Cap K at 24 and it discovers oscillatory, recovery-and-overshoot trajectories. Both setups train, but the K=24 version has 5-6×5\text{-}6\times richer dynamics. The richness is emergent from K-budget, not from the equations.

I want to be honest about one caveat. Strange-attractor signatures are partly a stress signal. They show up most when the substrate is at the edge of its capacity - the model thrashing because the lookup is hard. They mellow out as the binding circuit consolidates. So 'rich dynamics' does not unconditionally mean 'doing more cognition'. Sometimes it means 'pondering harder because the easy answer is not there'. That is still useful, but it is not the same thing.

Stops and ponders

The single observation I find most exciting is also the simplest. The model autonomously escalates compute when context is hard. With K-max=24, the Mamba-substrate run uses ~18 of those iterations on hard tokens and ~7.5 on easy ones. The halt criterion correctly identifies 'this token is hard' and burns compute trying. The colloquial framing maps onto it cleanly: the model stops and ponders harder problems. I have a separate post just on this finding.

Language is just one expert

There is nothing in the CWS that knows about language. The pool of experts can include any operator that produces a per-slot tensor of shape (B,T,nslots,d)(B, T, n_\text{slots}, d). Language is one such operator: an expert that runs RoPE attention or Mamba SSM over the token stream and exposes summaries to the workspace. Audio, vision, action are siblings, not afterthoughts. The decoder is a thin motor adapter on the output side.

I keep the decoder deliberately small - 1-2 layers, context 16\le 16, tied LM head. If you make it bigger, it learns to model n-grams locally and starves the CWS of gradient signal. The design rule is: surface continuity belongs to the decoder; cognition belongs to the workspace.

What is honest about where it is

On synthetic BIND/RECALL tasks, K-DOF is real and graded. More compute on harder lookups. On TinyStories at 6k steps with K-max=12, the same hypothesis fails - K allocation is essentially uniform across surprisal quantiles. K-DOF in the language regime so far looks more like a binary 'is this a recall event?' signal than a continuous reasoning-depth signal.

On raw perplexity, the current CWS is outperformed by a vanilla 2-layer transformer at 2×2\times the parameters. So the 'replaces ResNet' framing is too strong on present evidence. The honest claim is: CWS\text{CWS} is an alternative architectural primitive with a memory/depth tradeoff residual networks do not have, plus a difficulty-conditioned dynamics signal that might enable test-time-compute scaling without explicit chain-of-thought prompts. Whether that survives 100M+ parameters and 1B+ tokens is the question I cannot answer on a consumer GPU.

Why I keep working on it anyway

Because the alternative is to keep adding layers. Because the only test-time-compute story we have for transformers right now is chain-of-thought, which is a beautiful kludge but a kludge. Because dynamical-system thinking is older than the transformer paradigm and most of the foundational work in cognitive science assumes it; I would like to know what happens if you take that seriously as engineering.

And because every time I have tried to give up on it I find one more thing that surprises me. Memory collapse turning out to be the optimum. K-DOF emerging from training rather than being designed in. The fact that the constant-memory training contract works at all. None of these were predictions. All of them are reasons to keep going.

I will write more as I find more. There are separate notes on the v1 → v2 → v3 journey, on the 'stops and ponders' finding in detail, and on a few of the things that surprised me along the way. Code is private for now. If any of this resonates, the email is in the footer.