Paper Notes: Scaling Laws for Pre-Training Agents and World Models
To make big investments in scaling robotic learning, we need to understand what the scaling laws for robotics data actually are. This is something I’ve written about before, but couple new papers came up on Bluesky the other day, so let’s take a look.
The key result is this: the same types of scaling laws we see in language models, also seem to appear in robotics and embodied AI. Now, there are real questions about where all of this robotics data might come from, how we might get to the same scale as is currently used for flagship large language models like OpenAI’s GPT-4o and Anthropic’s Claude. But there’s a growing body of evidence that if we do this, we would hope to see the sorts of impressive generalization that come with those models.
You can find my previous post on scaling laws for embodied AI here:
What do we mean by “Scaling Laws?”

The idea of “scaling laws” is a powerful one: the concept is that if we increase the amount of data and compute available to solve a task, we should see an increase in performance.
Basically, what I want is to know is: if I keep spending money on compute and tokens (data), will the line go up? Will my models get better? How much data vs compute do I need? Where do I spend my limited resources?
I’ve visited this topic before. But it’s such an important one that I thought it was worth revisiting with this slightly newer paper, not included in the last one.
Specifically here I’m going through this paper from Microsoft Research:
Key Observations
Unlike many of the previous works I explored, this paper is looking specifically at training losses, not at downstream performance. It’s a much more classic ML feel to it, at least to me, compared to the much more robotics-focused works I looked at previously.
In addition, they look at scaling laws as they apply to world models, which isn’t something we saw before (since world models aren’t really useful for robotics yet).
They observe:
Scaling laws do seem to apply to world models (which, in this context, means action-conditioned video generative models)
The optimal trade-off between model and dataset size is influenced by the number of tokens per observation (the compression rate)
Scaling laws with behavior cloning are hard to observe with modest compute budgets
One really interesting observation is more of a meta observation about the field:
Robotics data is generally much, much more limited than LLM data. This means that we often train robotics models very differently, visiting many data points multiple times, leading to overfitting and a lack of generalization. This really isn’t the way LLMs are trained: LLaMa 3, for example, was trained ofr only a handful of epochs. The GPT3 paper showed even less:

So that’s what they’re talking about. This is an LLM-like, not a robotics-like, data regime, where we have a truly massive amount of good quality data.
Methods
They train two models, a classic token-based architecture and a convolutional, behavior cloning style architecture. They’re doing this task on a 4x4 multiplayer game called Bleeding Edge:
And yes, they have a TON of data. This is far more robotics data than likely exists in the world, spread out across 7 maps (so we’re going to see some real overfitting, even if the same exact data point is not visited).
What do they find?
Well, they find some scaling law coefficients, of course.
Concretely, what they’re finding is, given the minimum loss under a certain FLOPS budget for a model size and dataset size:
We can then fit power-law relations predicting these compute-optimal values:
For the numbers themselves, just go to the paper.
The interesting thing for behavior cloning is that we see a very different relationship between world modeling and behavior cloning, and dataset size. With behavior cloning, dataset size seemed much more important, and model size had minimal effect:
They provide a second set of experiments using a CNN architecture (above) which goes back to showing the normal model size based power law relationship:
But of course these losses are way higher, and we’re using CNNs instead of a normal tokenization scheme, so what’s even going on here?
It really looks like their behavior cloning policies are saturating. Behavior cloning in these kinds of environments is hard; and, on its own, these behavior cloning losses don’t necessarily even mean that much. Errors compound in much more dramatic ways in embodied AI problems.
Final Thoughts
Scaling laws are an interesting topic, but here I want to editorialize a bit. People really want rules to guide the kind of massive spends that people need to train huge AI models for robotics and embodied AI; something more compelling than “we’re pretty sure scaling works.”
This, unfortunately, doesn’t really exist. Even a paper like this really doesn’t convince me, especially since they’re only looking at the proxy task of loss, when real, embodied AI performance is not well-characterized by loss alone (if you want the analog in LLM-land, embodied AI tasks are more like agentic workflows like OpenAI operator - and these are notoriously hard to build and train).

The particular data, and the particular data mixture, matter a ton. The number of environmetns and objects matter a ton. I think coming up with the “recipe” for scaling robotics will not come from laser-focus on dataset losses.
The previous post I did on scaling laws explores this a bit more. I will almost certainly revisit the topic again - or go into more depth - but I also write about more interesting stuff, so please consider subscribing:
Future Reading
Another paper in the space that might be worth a look is this one, Preliminary Investigation into Data Scaling Laws for Imitation Learning-Based End-to-End Autonomous Driving:
Might end up in a future blog post.