Remote Robotic Teleoperation

Far from being an "easy out" for autonomy, teleoperating robots has a range of unique challenges, but enables impressive real-world capabilities nonetheless

May 18, 2025

Imitation learning has powered a huge new wave of robotic operations. But many robotic systems still aren’t fully autonomous. And this makes sense! It’s hard to make any system that works on its own 99% of the time. I’d argue even humans can’t do this; if you’re stuck, you probably talk to a colleague about your problem or go ask for help.

Well, robots can’t do that; unfortunately, they still make terrible colleagues. So instead, many teams are turning to robotic teleoperation: the idea that we can use human operators to provide the intelligence behind a robot doing some real-world task, either to collect data or, increasingly, to solve the problem in the first place.

People tend to think of robot teleoperation as an “easy out,” but I wanted to write this article to describe some of the many technical challenges you face when implementing systems for teleoperating robots, from networking to robot control. Teleoperation as a solution for autonomy requires extremely performant networking and intuitive, powerful teleoperation - which actually raises its own autonomy questions, especially in the domain of humanoid robots.

But first, as always, if you want to read my writing about trends in AI and robotics, please click subscribe below:

Defining Remote Teleoperation

Robots are not yet as skilled as humans; we need ways to teach them visuomotor skills and to use robots to perform complex visuomotor tasks in different environments. Teleoperation, in this context, is the set of technologies we use to control robots remotely in order to perform a wide range of tasks.

Teleoperating robots lets us collect rich data in the real world, instead of using a simulation. Much of the recent push for robotics has been driven by some truly impressive results driven by high-quality real-world teleoperation and imitation learning.

A robot from Reflex Robotics deployed in a GXO facility. Reflex robots aren’t autonomous — they’re teleoperated — but they’re providing real value. Source

There are two major appeals to large-scale robot teleoperation:

It allows for collecting the very large amount of data that we expect general-purpose robot policies to need (see the humanoid robotics company 1x, which intends to deploy 100s of robots in homes in 2025)
At the same time, it allows for large-scale robot deployments that can provide value before these policies are truly ready (see Tesla Robotaxi, Watney Robotics, Reflex Robotics.)

Tesla Optimus teleoperators. In this case, they’re co-located, so I wouldn’t say this is truly *remote* teleoperation. From the Verge

Technical Challenges

Teleoperation allows us to bypass some of the problems of building truly autonomous robots, but it’s not trivial either. In particular, for teleop to work, we need:

Extremely high performance networking — since your teleoperators will likely not be in the same location as the demonstration
Intuitive and capable teleoperation systems — often based on commodity VR headsets like the Quest 3 and the Apple Vision Pro. Which also likely means…
Kinematic retargeting from human to robot

Data infrastructure to collect and stream massive amounts of data

Some of these have been addressed before in various ways — Tesla, for example, collects a truly massive amount of data from its fleet of cars for training Autopilot. And the rise of affordable consumer VR hardware has definitely made teleoperation easier (yes, the Apple Vision Pro is affordable in the context of robotics — a $3k headset is nothing when you’re already buying a $70k Unitree G1). See, for example, this demo from Ryan Hoque of the Apple Vision Pro being used for robot teleoperation [1].

But if you want to teleoperate from a continent away — something that’s essential if you want teleoperation as a product — then we have other issues.

Problem 1: Networking is Hard

Presume, for a moment, that you want to start selling “intelligent” robots tomorrow, before VLAs or reinforcement learning or whatever approach pays off and delivers embodied general intelligence. You might imagine paying people to teleoperate robots to perform work, as has been suggested by so many different companies.

One essential problem is just how to get information from one place to another. Dropped packets, network instability, and so on can really impact network performance, and engineers work to get every last bit of performance out of their systems — as Watney Robotics mentions above.

Fortunately, this isn’t as much of a deal-breaker as it might first appear.

First, the issue itself is overstated. It turns out humans are pretty smart, and can account for some pretty serious network issues! You can see this in action from the EarthMovers competition at IROS 2024:

FrodoBots pitched human and AI teams against one another at teleoperating small wheeled bots at locations all over the world, from Wuhan, China to Kisumi, Kenya. Humans beat the AI players pretty soundly at a game of navigating to randomly placed “coins” throughout the world, but the point here is that the task was pretty readily achievable. The data was recently used to train LogoNav, a foundation model for robot driving that works in unseen environments.

Similarly, Open TeleVision [2] has demonstrated cross-country teleop, as well as teleop from Munich, Germany to San Diego, California. Basically, web infrastructure is quite powerful right now, and you can in fact use the combination of open AR/VR headsets with web streaming to enable at least basic teleop from anywhere to anywhere (with good internet), right now — although latency won’t be the best.

Michael Cho of Frodobots and Robopapers tries out TeleVision.

Part of the solution here will be good, old fashioned engineering, as it so often is. But there are other technologies that could help here: for high-value applications like a live televised interview, companies like LTN have high-performance overlay networking solutions to guarantee low latency.

Frameworks like shared autonomy can help as well: we can train a robot policy to seamlessly predict a human’s actions from their motions. And changes to teleop strategy can help as well: we could send high-level goals like “move to this pose,” leaving execution particulars up to the robot. If I had to guess, this is how I imagine the teleoperation systems used by Tesla and Waymo work.

For an illustration, check out this video from Zoox:

(watch the YouTube link for sound and more context)

Instead of using a steering wheel to drive the car remotely, the human operator is drawing a path for the vehicle to follow. This allows you to provide extra safeguards, and makes the whole system much more robust to network issues. There are equivalents to this system for robotic manipulation — I’ve even worked on a couple — although they end up being very complex in their own ways.

Problem 2: Robot Control is Hard

So, let’s assume we put all this infrastructure in place; we can get relatively fluid commands from wherever in the world our teleoperators will sit, and stream them to the deployed robots. These commands will be coming from humans, in a variety of shapes and sizes, and sent to robots, with a decidedly different set of shapes, sizes, and mechanical properties.

We now have a second, very serious problem: the robot is not a human. It has different kinematics; its limbs are all different lengths, its joints behave differently. If you were to just replay a human demonstration, as suggested by many folks on BlueSky, you’d just see the robot fall over because of these subtle differences. Balancing a humanoid robot through a complex dance routine is hard.

As an illustration, think about how often kids or teenagers trip or fall just because they’ve been growing a few millimeters and their model of their own body is slightly out of date. They fall all the time, even on flat ground! There’s a reason we have this mental image of the awkward teenager. And these issues are substantially simpler than the ones faced by a robot teleoperator: in the case of a growing child, the kinematics are almost identical. In the robot’s case, mechanical properties can be almost entirely different.

Instead, we need to build a model — from data — which can take human actions and retarget them to the robot. These actions can be captured from something like a Quest 3 or Apple Vision Pro headset, giving us hand positions for the robot to track and a head position for where it should look at any given time.

But this raises a further issue: unless you have whole-body motion capture — like the Tesla operators seem to — then this is an underspecified problem. The headset is telling us where the hands and head should be, but not where the elbows or legs should go!

Thus, we have the very active sub-field of methods for teleoperation on humanoid robots.

Overview of Mobile TeleVision, teleop for mobile manipulation on humanoid robots. From Twitter/X

Open TeleVision [2] and Mobile TeleVision [3] are examples. They observe that the upper-body control problem can be (at least in part) solved by inverse kinematics: given the 6dof spatial positions you need the end effector to be in, compute the specific joint positions necessary for the robot to achieve that position. This doesn’t solve the balancing problem though. The rough strategy then is this:

Take a human motion dataset and retarget it to the robot. As mentioned above, this results in infeasible motions; if you executed them on hardware, the robot would fall over.
Train a Predictive Motion Prior (PMP) to predict where the human will reach next
Separately, use Proximal Policy Optimization (PPO), a reinforcement learning method, to train a policy for the lower body so that it stays stable while performing these motions.

We also see work aimed at full autonomy like TRILL [4], which uses deep imitation learning to perform mobile manipulation with humanoids, and trained 3d diffusion policies [5]. TWIST [6] is an impressive, recent example of this sort of approach, which uses a large offline human motion dataset and simulation training to learn a dexterous, capable whole body controller:

But note that while there’s a lot of progress in this space, none of this can be said to be solved yet. TWIST relies on an expensive motion capture setup [6], and lacks robot feedback; all of these methods will be dependent on available human data and the robustness of their underlying RL training. In the end, robotics teleoperation can often be nearly as challenging as autonomy in its own way.

Interesting Examples of Teleoperation

Humanoid and self-driving robotics companies (Tesla and Waymo, for example) heavily use teleoperation for data collection and failure recovery
Chinese workers in Xinjiang teleoperating a robot in a mine
Watney Robotics is doing teleop for cleaning and organizing tasks
Reflex Robotics has some great examples of dexterous teleop doing challenging tasks like playing beer pong
Frodobots is doing teleop for robotic gaming
Hello Robot has a teleop platform for in-home autonomy for people with living with disabilities
CMU researchers proposed FACTR [7], a cheap, low-cost teleop system that allows for force-feedback and force control. This kind of system could solve some of the limitations that exist with current teleop infrastructure, which largely lacks force feedback. Check out the project website.
HOMIE is a teleop “exoskeleton” [8] for more precise control of a humanoid robot - possibly mitigating some of the issues we see with VR teleop systems. Another similar system is AirExo, a low-cost exoskeleton for in-the-wild robot teleop [9].
DexGen from Meta and Berkeley [10] allows for human teleoperation of extremely precise, dexterous tasks like using a screwdriver.

Conclusions

Teleoperation is a powerful technology with a number of challenges related to infrastructure and to actually controlling robots while they perform difficult tasks, but it’s got a ton of potential and very talented people all over the world are working to make it more of a reality every day.

This post barely touches the surface of both the work in teleoperation, its challenges, and how people aim to address those. I hope that this set of references of interesting papers, and the collection of links, is a useful starting point.

As always, if you like this type of content, please subscribe. If you have anything to add, or just want to share your thoughts, please leave a comment!

References

[1] Nechyporenko, N., Hoque, R., Webb, C., Sivapurapu, M., & Zhang, J. (2024). ARMADA: Augmented Reality for Robot Manipulation and Robot-Free Data Acquisition. arXiv preprint arXiv:2412.10631.

[2] Cheng, X., Li, J., Yang, S., Yang, G., & Wang, X. (2024). Open-television: Teleoperation with immersive active visual feedback. arXiv preprint arXiv:2407.01512.

[3] Lu, C., Cheng, X., Li, J., Yang, S., Ji, M., Yuan, C., ... & Wang, X. (2024). Mobile-television: Predictive motion priors for humanoid whole-body control. arXiv preprint arXiv:2412.07773.

[4] Seo, M., Han, S., Sim, K., Bang, S. H., Gonzalez, C., Sentis, L., & Zhu, Y. (2023, December). Deep imitation learning for humanoid loco-manipulation through human teleoperation. In 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids) (pp. 1-8). IEEE.

[5] Ze, Y., Chen, Z., Wang, W., Chen, T., He, X., Yuan, Y., ... & Wu, J. (2024). Generalizable humanoid manipulation with improved 3d diffusion policies. arXiv preprint arXiv:2410.10803.

[6] Ze, Y., Chen, Z., AraÃšjo, J. P., Cao, Z. A., Peng, X. B., Wu, J., & Liu, C. K. (2025). TWIST: Teleoperated Whole-Body Imitation System. arXiv preprint arXiv:2505.02833.

[7] Liu, J. J., Li, Y., Shaw, K., Tao, T., Salakhutdinov, R., & Pathak, D. (2025). Factr: Force-attending curriculum training for contact-rich policy learning. arXiv preprint arXiv:2502.17432.

[8] Ben, Q., Jia, F., Zeng, J., Dong, J., Lin, D., & Pang, J. (2025). Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit. arXiv preprint arXiv:2502.13013.

[9] Fang, H., Fang, H. S., Wang, Y., Ren, J., Chen, J., Zhang, R., ... & Lu, C. (2024, May). Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild. In 2024 IEEE International Conference on Robotics and Automation (ICRA) (pp. 15031-15038). IEEE.

[10] Yin, Z. H., Wang, C., Pineda, L., Hogan, F., Bodduluri, K., Sharma, A., ... & Mukadam, M. (2025). DexterityGen: Foundation Controller for Unprecedented Dexterity. arXiv preprint arXiv:2502.04307.

It Can Think!

Discussion about this post