Forget fully autonomous AI-controlled robots for now. A study by Nvidia, UC Berkeley, and Stanford, published on The Decoder, reveals that even advanced language models like Gemini-3-Pro or GPT-5.2 perform worse at controlling robots than humans when left to their own devices. Without human "building blocks"—pre-defined commands and abstractions—their reliability in executing even simple manipulations plummets rapidly.

Effectiveness only emerges when AI models are granted access to ready-made functions, such as "grasp object X and lift it." In such scenarios, the AI's task simplifies to correctly sequencing these functions, rather than solving all sub-tasks from scratch independently. Attempting to feed raw video data directly to these models only degrades performance further.

Researchers hypothesize that the issue lies in insufficient cross-modal alignment. Foundational models are rarely trained to operate simultaneously with code and physical command execution. An intermediate "Visual Difference Module" proves much more effective. This module describes the scene, extracts relevant properties, and registers changes after each step, providing structured text information for generating the next code block.

The key to enhancing the reliability of AI-controlled robots is "agent scaffolding"—structuring tasks and providing the AI with pre-established behavioral patterns. This approach mirrors software development, incorporating reinforcement learning, resource scaling for parallel solution generation, self-correction, and automated debugging with the accumulation of reusable functions. Based on these principles, the CaP-X model was created, enabling a robot to operate according to a given "script" while adapting through AI.

True AI autonomy in robotics will necessitate not only model development but also significant preparatory work from businesses. You should expect hybrid solutions in the coming years, where AI assists in controlling robots within human-defined structures and patterns, rather than a complete relinquishing of control. This implies that companies must invest in building "AI infrastructure" and cultivating relevant expertise to gain real competitive advantages, rather than merely observing another technological marvel.

This indicates that businesses must proactively invest in the foundational infrastructure and specialized expertise required to harness AI in robotics, moving beyond passive observation to active strategic implementation for tangible business outcomes. The era of truly autonomous AI robots is not yet here; the immediate opportunity lies in augmenting human control with intelligent, structured AI assistance. Your company’s readiness to build this scaffolding will determine its success in this evolving landscape.

Artificial IntelligenceRoboticsAI AgentsNVIDIAAI in Business