The primary bottleneck for modern Vision-Language-Action (VLA) models is a catastrophic data deficit. While some developers attempt to train neural networks on synthetic data, others are hitting a financial wall: gathering high-quality physical demonstrations currently requires either VR headsets or cumbersome teleoperation systems that cost as much as a used car.

Researchers Om Mandhane and Bipin Yadav from the VESIT Institute have proposed a more elegant solution: using a standard smartphone as a 6-DoF (six degrees of freedom) controller. Their project, Phone2Act, leverages Google’s ARCore technology to track the device’s spatial movements and stream those coordinates directly to a robot. No proprietary sensors are required—just your phone and a few lines of code.

The technical architecture of Phone2Act is built on ROS 2 and a series of bridge nodes, effectively turning the system into a universal remote. The developers successfully adapted the software for both the budget-friendly LeRobot SO-101 and the industrial-grade Dobot CR5 manipulator. One particularly clever design choice is the gripper control: rather than forcing the operator to tap the screen, the system uses the phone's physical volume rockers. This allows you to move the phone through space while clicking buttons like a gamepad.

The entire session, including RGB camera streams, is captured via Universal Recorder directly in the LeRobot format. This means the data is ready for neural network training immediately, with no additional preprocessing required.

In benchmark tests using the GR00T-N1.5 model, the industrial Dobot CR5 achieved a 90% success rate in pick-and-place tasks after training on just 130 smartphone-recorded episodes. While accuracy is naturally tied to the quality of a specific phone’s IMU sensors—and Phone2Act isn't looking to unseat professional motion-capture systems worth tens of thousands of dollars—the project’s real value lies elsewhere.

Phone2Act represents a major leap for Open-source Robotics by radically lowering the industry’s barrier to entry. When robot training data becomes crowdsourced, the scale of datasets depends on the number of available mobile devices rather than the size of a research lab’s budget. It is a classic case of an accessible solution outperforming elite technology through sheer volume.

RoboticsOpen Source AIComputer VisionMachine LearningPhone2Act