World models and the text ceiling

Large language models are trained on trillions of tokens of text, and they are very good, but the way they come to be good has always struck me as a little odd when you set it next to how a person learns. A child does not read the internet. It spends its first couple of years bumping into furniture, dropping things and watching them fall, learning that the cup that rolled behind the sofa has not stopped existing, and all of this happens in vision and touch and sound long before it can produce a single grammatical sentence. Text is the thing we reach for last, once everything else is already in place, and it is the most compressed and abstract description of the physical world we have. A model that only ever reads is learning the world through its thinnest possible summary, and world models are the line of work that asks what you would get if you trained on something closer to the raw thing instead.

These are my notes from reading my way into that question. They assume you are comfortable at a high level with a couple of pieces I lean on rather than rebuild: a variational autoencoder, the kind of network that learns to compress an image down to a small vector and reconstruct it, and a recurrent network of the LSTM sort that carries a hidden state forward through time. It also helps to have seen the basic reinforcement learning setup, an agent taking actions in an environment and being scored, since the whole point of a world model is eventually to do without that environment.

The text ceiling

The reason to want something other than text is that text sits at the top of a tower of abstraction whose lower floors the model never visits. When I write that a glass pushed off a table shatters on the floor, that sentence is the final compressed report of a physical process: gravity, acceleration, the brittleness of glass, the way the pieces scatter. A language model learns the report and the statistics of how such reports are worded, and it can become startlingly fluent at producing more of them, but it has only ever seen the summary and never the process the summary is summarising. It knows that “shatters” tends to follow “glass” and “floor” because the corpus says so, not because it has any internal sense of a falling object.

For an enormous amount of what we ask models to do, the summary is genuinely enough, which is exactly why language models took over the way they did. But it leaves a gap that shows up the moment you want a model to act in the world rather than talk about it, to predict what happens next when a robot arm nudges a stack of blocks, and that gap is the thing world models are trying to close by learning the lower floors of the tower directly.

What a world model is trying to be

The idea, stated plainly, is to give a model its own internal simulator of the environment. Rather than feeding it descriptions of what happened, you train it to predict what the world will do next from what it has seen so far, so that it builds up, inside its own weights, a runnable model of how things move and bump and fall. To predict the next frame of the world well, it has no choice but to absorb a working sense of the laws that govern the frames, the rough physics of the scene and the cause and effect that links one moment to the next, because those are the only things that make the future predictable at all.

This is a different objective from next-token prediction only in what it predicts, and that turns out to matter a great deal. Predicting the next token rewards you for modelling language. Predicting the next state of a scene rewards you for modelling the scene, and a model that can hold the scene in its head and roll it forward is doing something much closer to what we mean when we say a person imagines the consequences of an action before taking it.

The original three pieces

The paper that put a clean architecture under all of this is the 2018 World Models work by David Ha and Jürgen Schmidhuber, and it is worth walking through because the later systems are recognisably its descendants. They start with an agent in a fairly simple environment, a racing game and a Doom level, and split the problem of learning that environment into three parts that each do one job.

The first part is vision, and it is a variational autoencoder. Each frame the agent sees is a sizeable image, far too high-dimensional to reason about directly, so the VAE learns to squeeze a frame down to a small latent vector, written \( z_t \), that keeps the features that matter, the track ahead, the walls, the enemy, and discards the pixel-level detail that does not. This is the same compression a person does without noticing when they glance at a road and register “bend coming up” rather than every blade of grass.

The second part is memory, and it is where the prediction lives. A single latent \( z_t \) tells you what the world looks like right now but nothing about where it is going, so they add a recurrent network with an LSTM at its core that carries a hidden state \( h_t \) summarising everything seen so far. Given the current latent, the action just taken, and that running memory, it predicts the next latent, and the wrinkle that gives the model its name, the MDN-RNN, is that it does not predict a single \( z_{t+1} \) but a probability distribution over it,

\[ P\!\left(z_{t+1} \mid z_t,\, a_t,\, h_t\right), \]

shaped as a mixture of Gaussians, which is what the mixture density network part supplies. Predicting a distribution rather than a point is the honest thing to do, because the future of a real environment is not deterministic: the enemy might appear from the left or the right, and a model forced to commit to one exact next frame would learn a blurred average of the possibilities, whereas a mixture can say there are a few distinct things that might happen next and put weight on each.

The third part is the controller, and it is deliberately tiny. All the hard-won knowledge of the world is already sitting in the vision and memory networks, so the thing that actually chooses actions can be almost trivially small, a single linear map from the current latent and the memory state to an action,

\[ a_t = W_c\,[\,z_t,\, h_t\,] + b_c, \]

with \( W_c \) and \( b_c \) its only parameters. Keeping the controller this small is a choice with a payoff: because it has so few parameters, you can train it with methods that would be hopeless on a large network, and more importantly it makes the point that perception and prediction are where the difficulty actually was, with action-selection a comparatively easy cap on top.

Training inside the dream

Here is the move that makes the whole thing feel like more than a tidy way to factor a network. Once the vision and memory models have learned the environment well enough, the memory model is itself a simulator: it takes a latent and an action and hands back a distribution over the next latent, which is exactly the interface the real environment offered. So you can unplug the agent from the actual game and let it act inside the memory model’s own rollout of the world, feeding each predicted latent back in as the next input and letting the controller learn entirely against this internally generated sequence. Ha calls this training inside the model’s dream, and the striking result of the paper is that a controller trained purely in the dream, never having acted in the real Doom level during learning, transfers back and plays the real level well.

That severing is the part I keep turning over, because it lines up so neatly with something humans plainly do. When you rehearse a tricky reverse-park in your head before attempting it, you are running a learned simulator of your car and the curb, taking imagined actions and watching imagined consequences, and refining a plan without spending a single real attempt. An agent that has built a good enough model of its world can do the same, practising in imagination where mistakes are free and only then acting for real, and that is a meaningfully different and more sample-efficient way to learn than blundering through the real environment over and over. It is also, for whatever it is worth, a story about intelligence that feels closer to general capability than a system whose entire experience of the world is the next word.

Do they scale, though

The honest counterweight to all this enthusiasm is that the thing language models have going for them, above everything, is that they scaled. Pour in more data and more parameters and they reliably got better, and crucially they got better as foundation models, single pretrained systems you can point at translation, summarisation, coding, and a hundred downstream tasks they were never specifically trained for. The world models I have described are by comparison domain-specific, a racing game here, a Doom level there, each a bespoke simulator of one small world rather than a general engine you can aim anywhere.

So the open question, and it really is open, is whether world models can climb the same scaling curve, growing from these narrow simulators into something general enough to be a foundation for acting in the world the way language models are a foundation for working with text. The 2018 paper was a proof of concept on toy environments, and the interesting development is that the years since have been a steady run of iterations pushing on exactly that question, which is most of what the rest of these notes is about.

LeCun’s case, and the case against it

The loudest voice for the world-model view has been Yann LeCun, who spent years arguing it from inside Meta and contributed the JEPA family of architectures, the Joint Embedding Predictive Architecture, along with its image and video versions I-JEPA and V-JEPA. The core of JEPA is to predict in the abstract latent space rather than in raw pixels, on the reasoning that a model wastes its capacity if it is forced to predict every leaf and texture, and should instead predict the gist of what comes next the way the memory model above predicts a latent rather than a frame. His standing critique of language models is the one these notes opened with, that a system trained only token-to-token has no real model of the physical world underneath, only the surface statistics of how the world gets described. He has since left Meta, around late 2025, reportedly to start a venture built around this thesis, so the work has outgrown its old institutional label.

I think the critique is mostly right and also a little unfair to language, and it is worth being honest about both halves. Right, because a model that has only read about falling glasses genuinely does lack the runnable physics a world model is built to acquire, and you can watch that lack surface as confident nonsense about anything spatial or mechanical. Unfair, because language is not only a thin summary of the physical world: it also carries an enormous amount of real structure about that world, the way grammar encodes who did what to whom, the way nouns and verbs and the relations between them mirror objects and the things that happen to them, the accumulated facts and figures of speech that are themselves compressed observations of how the world behaves. A model that has truly absorbed language has absorbed a great deal about the world along with it, just indirectly, and the gap between the two camps is narrower than the sharper rhetoric suggests.

The line is blurring

In practice the two camps have been converging anyway. The first step was multimodal language models, where a language model is given eyes: a vision encoder turns an image into tokens the model can attend to through cross-attention, so that the same system that handles text can now perceive a picture and answer questions about it. That alone closes part of the gap LeCun points at, because such a model is no longer purely token-to-token in the world of text, it is taking in the visual world as well, even if it is not yet predicting how that world evolves.

The step beyond that is to let the model act, which is the idea behind vision-language-action models, usually shortened to VLA. Here a vision transformer and a language model are wired together not just to describe a scene but to emit action tokens, discrete commands that drive a body, so the same architecture that learned to caption an image learns to move a robot through the scene it is looking at. This is the lineage behind the humanoid robots now being shown off, including the Neo home humanoid that 1X opened orders for in October 2025. It is worth saying plainly, because the marketing tends not to, that at launch Neo leans heavily on a remote human teleoperating it in a headset for anything non-trivial, learning from those demonstrations rather than acting autonomously, and 1X’s own video-based world model for the robot arrived separately, in early 2026, as the simulator it learns inside. The autonomy is still being built up out of the human’s hands.

The new players, and what they are missing

A recurring complaint about even the multimodal models is that perceiving an image is not the same as understanding the space it depicts, and that they lack genuine spatial awareness, a sense of the three-dimensional scene behind the flat picture. This is the gap Fei-Fei Li set her company World Labs against, and their system Marble, launched commercially in November 2025, generates an actual explorable three-dimensional scene rather than a picture of one, representing it as a cloud of Gaussian splats, little fuzzy particles in space you can move a camera through. With Marble you are looking at a world model’s reconstruction of an environment laid out in real space, which is a real step past a model that only ever produced a 2D frame.

What Marble does not have is a controller in the sense of the 2018 paper, something that grapples with the physics of the world and acts inside it: it builds the space, but it does not give you an agent that learns to do things there. That side of the problem is what General Intuition, founded by Pim de Witte as a spin-out of the game-clip app Medal, has gone after, training world models and agents to reason about and act inside games and simulations using Medal’s vast library of gameplay video as the raw experience. Between a system that builds the world and a system that learns to act in it you can see the two halves of Ha’s architecture being pursued by different companies.

Google’s DeepMind has been pushing on both halves at once. Their SIMA work, the Scalable Instructable Multiworld Agent, is a generalist agent that follows natural-language instructions across many different 3D games, and its second version from November 2025 puts Gemini at the centre as the reasoning core. Alongside it sits Genie 3 from August 2025, a world model that spins up an interactive, navigable, real-time environment from a text prompt, a world you can actually move around inside as it is generated. The pairing is the obvious one: drop the agent into the generated world and let it learn there, which is the dream-training of the original paper scaled up to environments conjured on demand.

NVIDIA has staked out the infrastructure layer with Cosmos, a platform of pretrained world foundation models aimed squarely at physical AI, robotics, and autonomous vehicles. Its pitch is less about a single flashy demo and more about generating large amounts of photorealistic, physics-respecting synthetic data, the augmentation and simulation that downstream models for cars and robots are trained against, which is its own quiet argument that the most immediate use of a world model is as a factory for experience that would be too expensive or dangerous to collect for real.

What they are actually for

Pull back and the same capability is doing the work in several places at once. A world model that can roll an environment forward convincingly is what lets you generate video that holds together over time, train a self-driving car against millions of simulated miles it never had to drive, and bring a factory robot up to competence in simulation before it touches a real production line, all of which are the same trick of practising in a learned world where mistakes are cheap.

The one claim in this neighbourhood worth handling carefully is that a video generator like Sora is itself a world model. OpenAI has framed its video models as world simulators, and the framing is not unreasonable, since to generate a coherent clip you do have to model something about how scenes evolve. But it is contested rather than settled: critics, LeCun among them, point at the physics violations that still slip through, objects that pop into existence or float free of gravity or fail at object permanence, as evidence that a model can produce a plausible-looking video while not actually having a stable model of the world underneath. It is the cleanest illustration of the open question these notes keep circling, which is whether predicting the surface of the world well is the same thing as understanding it, or only a very good imitation of doing so.

Takeaway

The thread running through all of it is the bet that the next big capability comes from models that carry a runnable simulator of the world rather than a description of it, and that you can then have them learn by acting inside that simulator instead of inside reality. The 2018 paper is the clean statement of the idea, vision compresses, memory predicts a distribution over what comes next, a tiny controller acts, and the agent can be severed from the real environment once the simulator is good enough. Everything since has been an attempt to scale that from toy games toward something general, with World Labs building the space, General Intuition and DeepMind building the agents that act in it, NVIDIA building the data factory underneath, and the humanoid companies trying to stand the whole stack up on two legs.

Whether it gets to the generality that made language models a foundation rather than a collection of bespoke simulators is genuinely not yet known, and the Sora debate is a fair warning that fluent prediction can outrun real understanding for a good while before the gap shows. But of the directions on offer it is the one that most resembles how a person actually comes to know the world, by building a model of it and rehearsing inside, and that resemblance is reason enough to keep watching where it goes.