Generative AI for Creating Robot Data

RoboEngine and one way to get the data we need for scaling robot learning

Apr 10, 2025

To scale robot learning, we need very large datasets that do not currently exist, despite massive efforts like Open X Embodiment. It also, crucially, relies on a diversity of data which is deeply challenging to collect in the real world. Instead, as suggested by recent works like RoboEngine [1], we can use generative AI to create that data, as a part of a data augmentation pipeline powered by generative AI.

One way we can do this is via robot world models, where a neural net actually predicts motions, pixel by pixel:

This sort of approach has been proposed by Wayve and NVIDIA Cosmos, in addition to humanoid robot company 1x. But video generation often results in non-physical behavior; it also is limited, generally, to domains where you have lots of data. So, good for cars, not for precise manipulation (yet).

But let’s not count generative AI out, because there’s plenty of value here in using diffusion models to generate data. In this post, I’ll look at a few similar works which have a common theme: using diffusion to augment real world data generation to increase diversity [1,2,3,4].

Data augmentation for robotics using image generation models [1].

Because, again, for robotics to work, we need that crucial combination of data diversity and data quality that’s so hard to get without either really good real-world data or a really good simulator.

This is what this augmentation pipeline looks like:

An example from RoboEngine [1], a very recent work in this space.

But first, if you like these kinds of posts please subscribe:

Step 1: Collect Some Data

First, you need robot data. Rosie from Google Deepmind [3] used 130k demonstrations collected on 13 different robots over 17 months to train their language-conditioned policy. You, perhaps, can get away with less than that — RoboEngine used 50-100 examples to train individual robot skills [1].

Step 2: Segmentation

All of these approaches rely on running some kind of segmentation. Break up the scene into different components. RoboEngine [1] even provides a nice dataset of robot segmentation problems you can use (check on twitter/X).

This is important because we actually don’t necessarily want to change the robot appearance. Robots aren’t well represented in these kinds of datasets; generative AI models often make mistakes on them. By preserving the robot, we can then make sure that robot-specific information is preserved in the image — so we can still learn accurate, precise motions.

Step 3: Data generation

Now we take however many demonstrations you collected above, and use an image generation model to paint just within the masks you’ve found and labeled. In RoboEngine, they fine-tune a diffusion model specifically to ensure physical feasibility, in addition to the better segmentation model; this is a big change from previous works [2, 3], which have been somewhat limited by image generation models.

From Rosie, again. Painting different tablecloths to increase image data diversity.

Step 4: Train On Your New Dataset

Now, having followed these steps, you can test the new policy on new data. Because of the increased diversity of the training dataset, robot policies should now work on many different scenes.

And so, using a few tricks, you’ve turned your relatively small set of robot data into something that should scale quite a bit better! RoboEngine reported an increase from 15% to 60%; Google’s Rosie reported even more dramatic results in some cases:

Final Thoughts

As someone who believes in the value of good old-fashioned engineering in general-purpose robotics, these look like extremely useful tools for creating new robot data.

These are, in a lot of ways, a workaround for the fact that pure video generation isn’t quite physically realistic enough. That does mean there are only some kinds of randomization they can handle, though: they still aren’t going to accurately model cloth physics, for example, or be able to generate (meaningful) data of your robot manipulating new objects. But just scene level diversity is a key part of scaling robot learning, so this is all still valuable.

Data augmentation has been a huge part of computer vision for a long time. It makes sense to use new tools to apply it to embodied AI.

References

[1] Yuan, C., Joshi, S., Zhu, S., Su, H., Zhao, H., & Gao, Y. (2025). RoboEngine: Plug-and-Play Robot Data Augmentation with Semantic Robot Segmentation and Background Generation. arXiv preprint arXiv:2503.18738.

[2] Chen, Z., Kiami, S., Gupta, A., & Kumar, V. (2023). Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671.

[3] Yu, T., Xiao, T., Stone, A., Tompson, J., Brohan, A., Wang, S., ... & Xia, F. (2023). Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550. Source

[4] Chen, Z., Mandi, Z., Bharadhwaj, H., Sharma, M., Song, S., Gupta, A., & Kumar, V. (2024). Semantically controllable augmentations for generalizable robot learning. The International Journal of Robotics Research, 02783649241273686.

It Can Think!

Discussion about this post