
I recently embarked on an exciting experiment, giving my AI agent, OpenClaw, a physical robot arm to command. The results were nothing short of astonishing, challenging my preconceived notions of current AI capabilities.
This intelligent agent quickly learned to configure the arm, leverage its vision system to perceive its surroundings, and meticulously grab objects. Most remarkably, OpenClaw even managed to train another AI model to perform specific pick-and-place tasks. While truly general artificial intelligence might still be a distant horizon, these outcomes suggest we’re on the cusp of a significant robotics breakthrough.
Historically, training and controlling robots demanded specialized expertise and considerable time. However, the advent of sophisticated AI models is rapidly simplifying this complex process, making advanced robotics more accessible than ever before.
“AI-powered coding is super exciting because it has the potential to bridge the gap between conventional engineering methods, which are reliable but don’t generalize, and contemporary vision-language-action models, which generalize but are not yet reliable,” explains Ken Goldberg, a leading roboticist at UC Berkeley who champions this innovative approach.
My Hands-On Experiment with OpenClaw and LeRobot 101
To kickstart this experiment, I acquired a prebuilt robot arm known as the LeRobot 101. This fantastic piece of hardware is part of an open-source initiative by HuggingFace, designed to lower the barrier to entry for robotics enthusiasts and researchers alike.
The LeRobot 101 typically comes as a pair: a controller arm operated manually by a human and a follower arm, equipped with a camera, that mirrors those movements. This setup is ideal for training AI models, as they can learn to replicate human-demonstrated actions in response to visual cues from the camera.
Initially, connecting and calibrating the LeRobot proved to be a significant challenge, with one misstep nearly causing the motors to overheat. However, with the invaluable assistance of OpenClaw and Codex, I soon found myself “vibe coding” a simple program. This allowed the robot to close its gripper upon detecting a red ball.
Codex skillfully navigated the intricate configurations required to establish a connection with the robot, and together, we meticulously calibrated its joint positions. It then generated a Python script, leveraging various libraries, to effectively identify and grip the target ball. Despite the occasional “hallucinations” inherent in vibe-coding, especially when dealing with diverse hardware, the results were remarkably impressive.
Training an AI Robot Arm to See and Grab
While successfully gripping a red ball was a neat trick, it was merely the first step. My next objective was to have OpenClaw assist me in training a more sophisticated model to control the arm autonomously. We explored several training methodologies, with OpenClaw proving adept at guiding me through the iterations and meticulously monitoring the model’s error rate after each run.
Through this iterative process, the robot arm’s capabilities steadily advanced, eventually mastering the ability to pick up various objects. This achievement underscored the power of AI in robotics, transforming complex manipulation tasks into something achievable with agentic assistance.
This innovative concept, where AI-powered coding serves as the operational blueprint for robots, was first introduced in a seminal 2022 research paper dubbed “code as policy.” Since then, AI’s prowess in code generation has surged, and the “code as policy” methodology has rapidly gained traction across numerous robotics labs.
The “Code as Policy” Revolution and Future Outlook
To further advance this field, Ken Goldberg’s research group, in collaboration with experts from Nvidia, Carnegie Mellon University, and Stanford, recently developed CaP-X. This new benchmark is designed to rigorously measure the robotic coding capabilities of large language models.
Interestingly, CaP-X data reveals that Google’s Gemini model currently outperforms competitors like Claude or ChatGPT in programming robots. This advantage likely stems from Google DeepMind’s focus on training Gemini to be multimodal, granting it a more nuanced understanding of the physical world. Alongside the benchmark, the researchers also created CaP-Gym, a versatile environment that enables coding agents to control both simulated and real robots. Furthermore, they introduced CaP-Agent0, an agentic framework that significantly boosts the performance of these coding models, allowing them to surpass models trained for direct robot movement control on certain manipulation tasks.
Goldberg’s team is actively collaborating with Nvidia to explore the full potential of the “code as policy” approach. Spencer Huang, who has been instrumental in organizing internal hackathons at Nvidia to foster “vibe coding” in robotics, is working on a project with Goldberg. Their goal is to enhance the compatibility of this approach with a broader spectrum of robot software tools.
Huang envisions a future where “nearly anyone can get into robotics, which is the true holy grail.” He believes that enabling people to control robots through simple spoken or typed commands, or even by demonstrating an action, represents the “critical unlock for robots in society.” This vision promises a future where sophisticated robotic capabilities are no longer confined to highly specialized engineers, but are accessible to a much wider audience, fostering unprecedented innovation and application.
Source: Wired – AI