/aiman.tech
← Selected workR-001/Parallax

Parallax

A Central Workspace - a continuous-state dynamical system that settles into a metastable regime in a strange-attractor landscape, perturbed by a heterogeneous pool of experts. Cognition decoupled from the token loop. Reasoning depth at O(1) memory.

PyTorchCUDACustom kernelsCWSDEQ-flavoured
01

What Parallax actually is

Parallax v3 is a research codebase exploring the Central Workspace (CWS) - a modality-agnostic cognitive core inspired by global workspace theory. The CWS is not a language model. It is a continuous-state dynamical system whose only job is to settle into a metastable regime in a strange-attractor landscape that is shaped by associative perturbations from a heterogeneous expert pool.

Language is treated as a sensory observation. Tokens are summarised by experts at each context update; those summaries perturb the CWS attractor; a thin motor decoder reads the settled CWS state and emits the next token. Token steps and CWS steps are decoupled - the CWS does K reasoning steps, not token steps. Quiet, predictable context settles in K=1; surprising context burns more.

This is the bet. If reasoning is settling - if cognition is the dynamics of a small recurrent state being perturbed and pulled back toward a basin - then you do not need a residual stack of fifty layers to do it. You need a small core that knows how to ponder, and a heterogeneous pool of experts whose only job is to shape the basin.

02

The topology

The figure below is interactive. The gold core is the workspace - a small recurrent state that the CWS iterates over for K steps per context update. Around it, the mint specialists are experts: anything producing a per-slot tensor (an attention probe, a Mamba SISO read, an MLP probe, a SIREN). The CWS does not see raw tokens - only expert outputs. Position lives in experts; modality lives in encoders, experts and decoders. The CWS itself is permutation-invariant over slots.

TOKEN STREAMx_tSENSORY OBSERVATIONx_t-1x_t-2EXPERT POOLMamba SISOTOKEN-CLOCKED WRITEAttn probeROPE · H-CLOCKED READMLP probeSTATELESS · H-ONLYSIRENPERIODIC PRIORproposals (B,T,S,d)CENTRAL WORKSPACEh, cumKK REASONING ITERATIONShalt: ‖raw_k − raw_{k-1}‖ saturatesMOTOR DECODERDecoder2 LAYERS · CTX ≤ 16 · TIED LM HEADsettled hOUTPUTx_t+1NEXT TOKENTOKEN STEPS · ONE PER CONTEXT UPDATEK STEPS · MANY PER CONTEXT UPDATEDECODE · ONE PER TOKEN
[ FIG · 03 / Parallax · data flow ]
K reasoning steps decoupled from token steps
Fig 01 - End-to-end data flow. Token steps and K steps run on different clocks.
[ FIG · 01 / Parallax · CWS topology ]
live · interactive
Workspace
Specialist
Proposal
Broadcast
Fig 02 - CWS topology, interactive. Hover any peripheral to highlight its proposal; the gold pulse is the broadcast back.
03

The inner loop

At each of the K iterations the workspace reads from the experts, integrates, and updates. Two pieces of state carry across the loop: the workspace embedding h and a cumulative raw-attention buffer cum. Pseudo-code, with the load-bearing details:

h_eff = RMSNorm(h)
pool   = RMSNorm(concat(e(h_eff, x?) for e in experts))     # (B,T,S,d)
q      = Q(h_eff).view(H, dh)
k, v   = K(pool), V(pool)
raw    = q · kᵀ / sqrt(dh)                  # UNNORMALISED logits
cum   ← decay·cum + (1 - decay) + raw       # contractive update on raw
weights = log1p(silu(cum)) / ln(base)        # sublinear, ≥ 0, no softmax
h     ← O_proj( Σ_s weights_s · v_s )       # closure inside loop

Three details matter more than they look. Decay applies to the raw logits, not the weights - compressing-then-decaying destroys the dynamics. There is no softmax over slots: weights are non-normalised, so slots compete on absolute energy rather than partition. And cum decays toward 1, not 0 - the + (1 − decay) term is a pull toward unit attractor mass, without which the contractive map is degenerate.

04

Why this is not DEQ

The CWS is conceptually closest to a DEQ under associative constraints (a Banach-style contraction toward an input-perturbed regime), but the operating point is not a true fixed point. It is a metastable basin in a strange-attractor landscape. Equilibrium is detected via a raw-momentum readout - not by chasing the difference between successive hidden states to zero.

The training contract takes advantage of this. K-1 frozen iterations under no_grad, then one live step that receives all of the gradient. Reasoning depth and gradient depth decouple. Empirically, training memory is constant in K from K=8 to K=48 - the CWS gets you K=48 reasoning depth at the same VRAM cost as a 12-layer transformer. Reasoning depth and parameter capacity become independent axes. That is the headline finding, and the one that earns the "could replace residual networks" claim - if it survives scale.

step 1NO_GRADstep 2NO_GRADstep 3NO_GRADstep 4NO_GRADstep 5NO_GRADstep 6NO_GRADstep 7NO_GRADstep 8NO_GRADstep 9LIVE · GRADK - 1 ITERATIONS · UNDER no_gradONE LIVE STEP · GRADIENT FLOWSLOSS → ∇TIME · K → K_max
[ FIG · 04 / Training contract ]
K-1 frozen + 1 live = O(1) memory in K
Fig 03 - Training contract. Only the final step is live; everything before is frozen under no_grad. Memory does not grow with K.
05

What the dynamics actually look like

Two findings from sweeping over K-budgets are worth carrying around. First: strange-attractor dynamics are emergent from K-budget, not built in. Cap K at 4 and the CWS contracts monotonically; lift K to 24 and it learns to oscillate, overshoot, and quench - the model uses the room when you give it the room.

Second: the model autonomously escalates compute on hard tokens. Mamba-substrate runs with K-max=24 use ~18 of those iterations on hard tokens and ~7.5 on easy ones. The halt criterion correctly identifies "this token is hard" and burns compute trying. The colloquial framing maps cleanly: the model stops and ponders harder problems.

Honest counter-observation: rich settling dynamics can mean "doing genuinely interesting work" or "thrashing because the substrate is undersized for the task". Strange-attractor signatures are partly a stress signal. They are most pronounced during learning; they mellow as the binding circuit consolidates.

06

Where this came from

v1RAG REPLACEMENT · 2024CORE IDEA
Speculative decoding through learned LSH retrieval. <RECALL> token competing with the next-token softmax.
WHAT KILLED IT
~25% val WD-1 ceiling regardless of architecture
CARRIED FORWARD
Asymmetric gradient · backbone vs sidecar
v2PFC + SC · 2025CORE IDEA
Two cooperating systems: transformer PFC trained by CE, attention-over-memory SC trained only by cosine energy ∇.
WHAT KILLED IT
Memory bank collapsed to a single vector. Diversity-preserving fixes all hurt.
CARRIED FORWARD
cum = δ·cum + new_logits · log1p compression
v3CWS · 2026 · CURRENTCORE IDEA
Continuous-state dynamical system. K reasoning steps. Tokens as sensory observations. Modality-agnostic core.
OPEN QUESTION
TBD - "replaces ResNet" framing too strong on present evidence
CARRIED FORWARD
O(1) memory in K · K-DOF emergence
[ FIG · 05 / Three honest restarts ]
What carried forward · what was killed
Fig 04 - Three honest restarts. v1 was a RAG replacement; v2 a PFC + SC dual-process; v3 the CWS as a cognitive core.

v3 is the third honest restart. v1 was a RAG replacement - speculative decoding through a learned LSH retrieval lane and a special <RECALL> token. It hit a structural ceiling that no projector architecture could lift, and the lessons were: backbone-sidecar entanglement, discrete retrieval is too coarse, token-level interface is brittle, no test-time learning.

v2 abandoned retrieval entirely. Two cooperating systems - a PFC transformer trained by cross-entropy and an SC bidirectional attention module trained only by an explicit cosine energy between the two. The most surprising v2 result, and the one I still think about: the SC's memory bank collapsed to a single vector under training, all 128 embeddings within cosine 0.94 of each other - and that turned out to be the operating point that worked best. Diversity actively hurt.

v3 took the late-v2 token-level state-space SC update - cum = decay·cum + new_logits, weights via log1p(silu(cum)) - and made it the centre of the system. The PFC went away. Tokens became sensory observations. The decoder shrank to a thin motor adapter. The cosine energy turned into an attractor-flavoured contraction with momentum-saturation halt. By that point the project had stopped being a memory mechanism and started being a candidate replacement for the residual stack.

07

What is honest right now

On synthetic BIND/RECALL, K-DOF is real and graded - more compute spent on harder lookups. On TinyStories at 6k steps with K-max=12 the same hypothesis fails: K allocation is essentially uniform across surprisal quantiles. The most recent honest writeup says "K-DOF is a structural-recognition signal (binary 'is this a recall?'), not a continuous reasoning-depth signal" at this scale.

On raw perplexity, the current CWS is outperformed by a vanilla 2-layer transformer with double the parameters. So: the "replaces ResNet" framing is too strong on present evidence. The honest claim is that the CWS is an alternative architectural primitive with a memory/depth tradeoff residual networks do not have, plus a difficulty-conditioned dynamics signal that might enable test-time-compute scaling without explicit chain-of-thought prompts. Whether that survives 100M+ parameters and 1B+ tokens is the next question I cannot answer on a consumer GPU.

I am keeping the project public-by-notes for now and the v3 code private until the headline either survives scale or fails it honestly. Either way I will write up what I find.

Want to argue with any of this