Holo3.1: Local AI Agents Just Got a Major Upgrade

When we launched Holo3 last March, the response was incredible. Developers, enterprises, and partners swiftly adopted our state-of-the-art computer-use model, integrating it into everything from browser automation and business software to internal tools and desktop applications. The immediate and widespread adoption underscored the immense potential of universal computer-use agents, but it also revealed a crucial next step.

Our users, working across diverse workflows, made it clear that raw performance was just one piece of the puzzle. They needed seamless capabilities across desktop and mobile environments, effortless integration with various agent frameworks, and ultimate flexibility in deployment—from cloud inference to fully local execution on end-user devices.

That’s why we’re incredibly excited to introduce the Holo3.1 family. This release is a significant leap forward, designed to enhance robustness across the three critical dimensions that matter most in real-world production: diverse environments, agent frameworks, and flexible deployment targets. For the very first time, we’re releasing quantized checkpoints optimized for local inference, including advanced formats like FP8, Q4 GGUF, and NVFP4.

Holo3.1 moves us closer to our vision of truly universal computer-use agents. These are systems capable of operating across any environment, integrating into any agent stack, and running precisely wherever the workflow demands. It’s about putting powerful automation directly into the hands of users, wherever they are.

Elevating Universal Computer-Use Agents

Built upon the robust Qwen family, Holo3.1 was specifically engineered to fortify reliability across the full spectrum of environments where computer-use agents are deployed. While retaining Holo3’s state-of-the-art performance, we’ve focused on overcoming the common challenge of performance transfer.

As teams transitioned Holo3 from evaluation to production, a consistent pattern emerged: exceptional performance in one setting didn’t always translate flawlessly to another. Mobile devices, alternative agent harnesses, and varied execution frameworks each introduced unique sources of distribution shift, demanding a more resilient solution.

Holo3.1 significantly expands capabilities beyond traditional browser and desktop control, delivering major gains in mobile environments. On our rigorous AndroidWorld benchmark, the powerful 35B-A3B model improves from 67% to an impressive 79.3%. Even our more compact 4B and 9B variants show remarkable progress, jumping from 58% to 72%.

To better support teams integrating Holo into third-party agent stacks, Holo3.1 now natively supports function-calling protocols, building on the structured JSON outputs already available. This enhancement ensures near-parity performance for both function-calling and native execution across OSWorld and our internal benchmark suite, which covers e-commerce, business software, and collaboration workflows.

Crucially, Holo3.1 also delivers a more than 25% improvement over Holo3 when evaluated within our own Holotab product harness. To further enable local and on-device inference, we’re also releasing new, smaller model sizes—including 0.8B, 4B, and 9B—perfect for cost-effective, private deployments. These new variants complement our larger 35B-A3B model, which continues to offer state-of-the-art performance for the most demanding tasks.

Unleashing Speed: Fast & Local AI Inference

This release marks a pivotal moment as we ship our very first quantized weights for Holo3.1. We’re starting with the high-performing 35B-A3B checkpoints, now available in optimized FP8, Q4 GGUF, and NVFP4 formats. These formats are designed to bring powerful AI capabilities closer to the user, with minimal performance impact.

For NVFP4, we leveraged NVIDIA’s Model Optimizer in a W4A16 configuration, enabling lightning-fast local inference for computer-use agents with virtually no degradation in model performance. Both FP8 and NVFP4 achieve impressive OSWorld scores, consistently staying within about two points of the full-precision BF16 checkpoint.

The speedups are truly substantial and redefine what’s possible for local inference. On a DGX Spark, the NVFP4 W4A16 configuration delivers 1.41 times the total token throughput of FP8, and an incredible 1.74 times that of BF16. This means more efficient processing and faster responses for complex tasks.

We’re also proud to release Q4 GGUF checkpoints, specifically aimed at bringing powerful computer-use agents directly to consumer hardware. Imagine running sophisticated AI agents locally on your Windows or Mac machine, with the model either executing on the same device—even on Apple Silicon—or on a DGX Spark within your local network.

This approach ensures that execution remains fully private and local, with no data ever leaving your user’s network. On Spark, a combination of agent harness optimizations developed with NVIDIA and the advanced NVFP4 quantization delivers a compound ~2x end-to-end speedup over the FP8 baseline, dramatically cutting average step time from 6.8 seconds to a mere 3.3 seconds. These groundbreaking improvements, alongside enhanced agent request rates across platforms, will soon be integrated into an upcoming desktop agent harness, pushing the boundaries of what local AI can do.

Get Started with Holo3.1

The Holo3.1 family is now available in four versatile sizes, ranging from compact to high-performance. We are also releasing our highly optimized FP8, NVFP4, and Q4 GGUF checkpoints to empower local and edge deployments, bringing cutting-edge AI directly to your devices.

You can dive deeper into the technical details on our blog, explore the Holo Models API, or find our collection on Hugging Face. We are incredibly excited to see the innovative applications and solutions developers will build with the power and flexibility of Holo3.1.

Technical Blog
Holo Models API
Hugging Face Collection

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.

Elevating Universal Computer-Use Agents

Unleashing Speed: Fast & Local AI Inference

Get Started with Holo3.1

Kristine Vior

Related Posts