← NotesEssay · Open problems

Continual learning is the actual hard part

Published·2 Mar 2026·9 min·

Continual learningAdaptive cognitionOpen problems

Static benchmarks made us forget that the interesting question was never whether a model can do a task. It was whether a model can learn a new task without forgetting the old one. Here is why I think this is the cliff modern ML is about to walk off, and what would have to change.

Modern ML is mostly about training a model once, freezing it, and then evaluating it across a battery of tasks. That setup made sense as long as the goal was to ship a useful function. But it has a side effect that we rarely admit out loud. It builds in a deep assumption: the model never has to learn anything else.

If you remove that assumption - if you ask a system to keep learning, on the fly, without losing what it already knew - almost everything we love about modern ML breaks.

The catastrophic-forgetting wall

Train a network on task A. It does well. Now train it on task B without showing it any A. Performance on A collapses. The catastrophic forgetting result has been around since the 80s and we have not solved it. We have only learned to avoid it by not asking. Pretrain massively, then freeze, then deploy.

This works as long as the world we deploy into looks enough like the pretraining distribution. The moment it does not, you have two options. Refuse to update, and grow stale. Update, and lose what you had.

A static intelligence is a contradiction in terms. The whole point of intelligence is that it adapts.

The benchmarks lie

Most leaderboards measure single-task or fixed-suite performance. They do not measure forgetting. They do not measure how much new information a model can absorb without rewriting old information. They do not measure how the model behaves when its environment shifts.

If you accept the framing that intelligence is a fixed function, those metrics are fine. If you reject the framing - if you think intelligence has something to do with the ability to keep learning - then they are measuring the wrong thing entirely.

Active and adaptive cognition

The phrase I keep using is active, adaptive cognition. Active because the system is not waiting passively for a query - it is making predictions, testing them, updating its model of the world from the discrepancy. Adaptive because its representations and even its computation can shift in response to new regularities.

Humans do this without even noticing. Walk into a new room and your spatial model updates within seconds. Pick up an unfamiliar tool and your motor program adapts. Read a book on a topic you knew nothing about and a week later you have a working representation of that topic. None of this involves a backprop pass over a frozen prior.

What changes if we take it seriously

The base architecture has to support persistent state - a workspace, a working memory, something that survives across the equivalent of forward passes. This is the central workspace bet.
Specialists have to be able to update locally without rewriting the rest of the system. Modular plasticity - new competence acquired in one module without leaking destructive gradients into others.
The training loop has to look more like life and less like supervised batch SGD. Streams of experience, replay, surprise-driven reweighting. Closer to something like predictive coding than to the next-token loss.
Evaluation has to measure how the system behaves over a stream, not over a benchmark. How does performance on the first thing it learned survive after seeing a thousand other things? How quickly can it pick up a new regularity?

I do not pretend to have solved any of this. I am building Parallax with continual learning as a default rather than a fine-tune, and a lot of the design decisions there - persistent workspace, modular specialists, mediated interaction - are downstream of taking this problem seriously. Whether it works is a question for next year, not this one. But I am increasingly convinced that this is the cliff.