MolmoMotion: How AI Forecasts 3D Movement with Language

In the rapidly evolving world of artificial intelligence, machines have mastered the art of perceiving motion, accurately tracking objects and points in video. However, perception is inherently retrospective, telling us what has already happened. The true frontier lies in anticipating the future, allowing systems to predict how objects will move before it occurs.

Imagine a robot reaching for a cup, needing to foresee its exact path, or a video generator crafting physically plausible frames by understanding realistic motion. Predicting motion is a complex challenge, yet it holds far greater utility for a vast array of real-world applications. This forward-looking vision inspired the creation of MolmoMotion.

Today, we’re thrilled to introduce MolmoMotion, a groundbreaking language-guided 3D motion forecasting model. Given a single video frame, specified 3D points on an object, and clear written instructions like “Move and rotate the wooden bowl with fruit on the table,” MolmoMotion predicts the future 3D trajectory of those points over several seconds. This innovative approach achieves substantially stronger performance than existing forecasting methods, paving the way for more intelligent and interactive AI systems.

Alongside the model, we are openly releasing two crucial resources: MolmoMotion-1M, the largest collection of 3D point trajectories paired with action descriptions, derived from an impressive 1.16 million videos. We’re also unveiling PointMotionBench, a meticulously human-validated benchmark designed to accurately measure object-centric 3D motion forecasting accuracy across 2,700 diverse video clips. These open resources underscore our commitment to fostering community research and development in this critical field.

How MolmoMotion Works: Understanding Future Movement

MolmoMotion employs a deliberate and highly efficient method to represent motion: as object-attached 3D points in world space. This representation captures complex movements without the substantial computational cost of rendering full video frames. This unique choice was driven by the need for a general motion representation that is flexible, stable across camera changes, and compact for direct application in downstream systems.

A sparse set of surface points can effectively describe rigid, articulated, and even (within certain limits) deformable motion, without making assumptions about the object type. Because these points exist within a shared world frame, their trajectories remain consistent regardless of camera movement or viewpoint shifts. Their compact nature as explicit 3D trajectories makes them ideal for direct integration with systems such as robotic policies or sophisticated video generation models.

To forecast these intricate trajectories, MolmoMotion leverages Molmo 2 as its foundational backbone, enabling a seamless connection between language instructions and objects within an image. Starting with a brief video history, a descriptive action, and a set of query points with their initial 3D positions, the model first identifies the referenced object, the specific query points, and the motion described by the instruction. It then meticulously predicts the future 3D trajectory for each individual point, translating linguistic intent into precise spatial predictions.

To train MolmoMotion, we tackled the absence of large-scale, object-grounded 3D point trajectories paired with action descriptions. Existing 3D-track datasets were either too small or domain-limited, while vast internet videos lacked the necessary 3D annotations. Our solution was to build an automatic pipeline capable of extracting rich object-grounded 3D trajectories from unconstrained video at scale.

This sophisticated pipeline takes an input video and its action description, then produces object-grounded 3D point trajectories in metric world coordinates. Crucially, it addresses the challenges of noisy raw tracks—such as depth and tracking errors causing jitter—by filtering out inconsistent points, smoothing reliable trajectories, and segmenting clips to focus only on intervals where meaningful object motion occurs. This rigorous process yielded MolmoMotion-1M, the largest corpus of its kind, spanning 736 motion types and 5,600 distinct objects.

Powering the Future: Impact on Robotics and Video Generation

To rigorously evaluate MolmoMotion’s forecasting prowess, we developed PointMotionBench, a human-validated benchmark of held-out 3D trajectories. This comprehensive benchmark includes 2,700 clips across 111 object categories and 61 motion types, encompassing diverse scenarios like indoor manipulation, egocentric hand-object interaction, and dynamic outdoor scenes. Models are tasked with predicting 3D point trajectories given an observation, query points, and an action description, with evaluation based on the accuracy of their predicted paths against actual future motion.

Our experiments reveal that MolmoMotion consistently outperforms all existing 3D motion forecasting methods tested on PointMotionBench. This includes pixel-space video generators, parametric 3D methods, and even simple constant-velocity baselines across a wide range of objects, scenes, and actions. MolmoMotion demonstrates an impressive ability to forecast varied motions, from a lint roller moving on cloth to a flamingo dipping its beak while walking, with predicted paths closely adhering to instructions and ground truth.

The knowledge MolmoMotion acquires about motion is highly transferable, making it a natural fit for robotics. Whether a human hand or a robot gripper is lifting a cup, the cup’s fundamental path through 3D space remains similar. This makes MolmoMotion invaluable for robot planning, where anticipating object movement is crucial before execution.

After fine-tuning on DROID, a large dataset of real-world robot manipulation videos, MolmoMotion accurately predicts sensible object paths across various objects, camera viewpoints, scenes, and tasks. In simulation, a control policy powered by MolmoMotion achieved success in 76.3% of pick-and-place tasks, significantly outperforming a Molmo 2-based policy at 56.0%. Furthermore, it learned substantially faster, reaching 51% success after just 10,000 training steps, whereas the Molmo 2 version peaked at 19%.

MolmoMotion’s predicted paths can also revolutionize video generation. Instead of relying solely on an image-to-video model to infer motion from a text prompt, feeding in MolmoMotion’s precise predictions allows for generated video that follows requested actions much more closely. This is particularly impactful for small, precise movements that a generic prompt might describe only vaguely.

Empirically, using MolmoMotion to guide a video generator improves motion quality over the base model across all five motion-related metrics we measured. It even surpassed a much larger image-to-video model on four of these five metrics, demonstrating its profound impact on creating more accurate and controllable video content.

Looking Ahead: The Roadblocks and What’s Next

While MolmoMotion is a highly capable model, it’s important to acknowledge its current limitations. During training, it uses eight query points per object, which is sufficient for forecasting useful trajectories but may not densely represent complex surface geometry. This inherently limits the model’s ability to handle highly intricate deformable motion.

We firmly believe that forecasting—the ability to anticipate how objects will move—is as fundamental to advanced machine intelligence as perceiving existing motion. MolmoMotion represents a significant leap forward in this pursuit, offering 3D motion prediction that generalizes across object categories without requiring specific templates. Learned from ordinary video, it stands as the most accurate 3D motion forecaster we’ve measured on PointMotionBench to date.

We anticipate that this technology will unlock a myriad of applications in robotics, video generation, and beyond, fostering a new era of intelligent systems. We eagerly encourage the community to explore MolmoMotion by downloading the weights, inspecting the training data, and evaluating our methods against PointMotionBench.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.

How MolmoMotion Works: Understanding Future Movement

Powering the Future: Impact on Robotics and Video Generation

Looking Ahead: The Roadblocks and What’s Next

Kristine Vior

Related Posts