Language is just an interface
We treat language models as the substrate. They are not. They are an interface to an underlying world model that the model has been forced to construct in order to predict the next token. The substrate is the world model. Language is one way to query it.
When GPT-2 came out, the dominant story was that language models were learning statistical regularities of text. When GPT-3 came out, the story shifted to in-context learning. With GPT-4 and onward, the field has been quietly conceding something it took a decade to admit: there is more in there than a model of text.
The cleanest way to make sense of what large language models are doing is to stop treating language as the substrate. It isn't. The substrate is a world model. Language is one of several interfaces into that world model, and it just happens to be the one we trained against.
Why next-token prediction smuggles in a world model
If you want to predict the next token reliably across a corpus that contains physics, dialogue, code, recipes, history, and arguments about each of them, you need more than syntax. You need a representation that lets you simulate consequences. A representation in which 'the bottle fell off the table' is followed plausibly by 'and broke'. A representation in which 'I added two and three and got' is followed by '5'. The way you achieve that, given enough capacity and enough text, is by building structure inside the model that mirrors how the world works.
Language is a shadow of the world. To predict the shadow, you have to model the thing casting it.
We have known this for a while, but the implications are still working their way through the field. If LLMs are world models with a text interface, then several things follow.
Implications
- Multimodal is not 'adding capabilities'. It is shrinking the gap between the model's internal world model and the actual world. The model already has a partial physics module - vision, audio, action - just gives it more interfaces into the same internal substrate.
- Hallucination is what happens when the world model is forced to commit through a narrow interface to content it has not modeled well. Improvements in factuality come not from prompting around the issue but from making the world model better.
- Reasoning improvements from chain-of-thought work because they let the model use its world-model capacity sequentially rather than in a single forward pass. The thinking happens inside the model; CoT just gives it room to externalize.
- Pure text training is hitting a ceiling not because there is not enough text but because text under-determines the world. You can read every book about riding a bike and still fall off one. Text data alone cannot fully constrain the world model that the loss is implicitly asking for.
The pivot to general world models
I think this is why every serious lab is converging on something like a general world model. Not 'a bigger LLM'. Not 'multimodal LLM'. A model that has been trained on a substrate richer than text - actions, video, simulation, embodiment - so that the world model inside is constrained by more than the shadow.
Once you have that, language becomes one query interface. Code becomes another. Action becomes another. The model that responds to each is the same underlying world model, just routed through different heads.
What this changes for me
Two things. First, when I think about Parallax, the workspace is naturally a world-model substrate - it holds whatever content currently matters, and the specialists update on it. If language is just an interface, the workspace is the place where 'the world right now' lives. Second, I am very interested in the geometry of representations across modalities. If text and video and proprioception are interfaces to a common substrate, that substrate has a shape, and I would like to know what that shape is.
The reason this matters in the long run is alignment. If we keep treating language as the substrate, we keep aligning to the shadow. If we accept that the substrate is a world model, we have to ask how that world model is structured - what it represents, what it leaves out, what biases it baked in - and work directly on that.
Language was never going to be the end. It was the bootloader.