Nemotron-Labs Diffusion: Speed-of-Light AI Text Generation

Large language models (LLMs) have revolutionized how we interact with AI, becoming indispensable for tasks ranging from code generation and math problem-solving to summarization and document understanding. While their capabilities are vast, many of these powerful models still generate text one token at a time in an autoregressive (AR) fashion. This traditional method, where each new token depends on its predecessors, has been incredibly successful, yet it carries inherent limitations that impact speed and flexibility.

The autoregressive approach, despite its stability and simplicity, presents a bottleneck for latency-sensitive applications. Each token requires a full model pass, often leading to GPUs spending more time on memory operations than actual computation, especially with smaller batch sizes. Furthermore, once an autoregressive model generates a token, it’s final; there’s no inherent mechanism for revision, meaning early mistakes can propagate throughout the generated text.

Breaking the Autoregressive Bottleneck in LLMs

Enter Nemotron-Labs Diffusion, an innovative family of language models that redefines text generation. Moving beyond the token-by-token constraint, these Diffusion Language Models (DLMs) generate multiple tokens in parallel, then iteratively refine them over several steps. This not only offers significant runtime performance benefits by better utilizing modern GPUs but also introduces the crucial ability to revise previously generated tokens.

This generate-and-refine capability makes Nemotron-Labs Diffusion ideal for tasks like text revision or “fill-in-the-middle” objectives. A major advantage is the built-in control over inference budget, as developers can reduce computational requirements by adjusting the number of refinement steps. It’s a game-changer for speed and accuracy in AI text generation.

Introducing Nemotron-Labs Diffusion: A New Paradigm for AI Generation

The Nemotron-Labs Diffusion family is designed to be highly accessible and versatile. It includes text models available in 3B, 8B, and 14B scales, all released under the commercially-friendly NVIDIA Nemotron Open Model License. For broader research flexibility, an 8B vision-language model (VLM) is also available under the NVIDIA Source Code License.

NVIDIA is making both base models and instruction-tuned chat variants available across the lineup, providing options for various use cases. To empower developers further, the code for training these models is being released through the NVIDIA Megatron Bridge framework. This comprehensive offering ensures that cutting-edge diffusion capabilities are within reach for everyone.

Unleash Unprecedented Speed with Flexible Generation Modes

At its core, Nemotron-Labs Diffusion operates on a brilliant principle: autoregressive and diffusion generation should be capabilities of the same model, not separate families. This design offers developers three powerful generation modes, allowing seamless switching based on application needs without significant code changes.

Autoregressive Mode: Functions like a standard left-to-right LLM, ensuring compatibility with existing workflows and serving as a reliable correctness reference.
Diffusion Mode (FastDiffuser): The powerhouse for raw throughput, this mode generates text block by block, iteratively denoising a 32-token segment at a time. A confidence threshold determines which tokens are committed in each step.
Self-Speculation Mode (LinearSpec): Our favorite for balancing speed and accuracy, this mode drafts a block bidirectionally using diffusion, then verifies it causally with AR decoding. It offers lossless output compared to AR at temperature 0, achieving remarkable speeds of ~865 tokens/second on a B200 GPU—roughly 4x the autoregressive baseline on the same hardware.

The Nemotron-Labs Diffusion 8B model showcases impressive performance, achieving an improved average accuracy of 1.2% compared with Qwen3 8B. When measuring token decoding efficiency (tokens per forward pass, or TPF), the diffusion mode reaches 2.6x higher TPF than AR models, while self-speculation pushes this further to 6x for linear self-speculation and 6.4x for quadratic self-speculation, all with comparable accuracy across evaluated tasks. This flexibility and speed are key, even for single queries or unpredictable batch sizes.

Engineering the Future of Language Models: Training and Deployment

Historically, diffusion language models faced practical hurdles, including lower accuracy, training difficulties, and limited KV caching compatibility. However, recent breakthroughs, particularly the work on Efficient-DLM, demonstrated that pretrained AR models could be converted into efficient DLMs through continued pretraining and a block-wise attention mechanism. This crucial insight helped preserve AR capabilities while enabling KV-cache-friendly parallel decoding.

Nemotron-Labs Diffusion builds directly on this foundation by adding diffusion capabilities to an existing AR model. The model was trained with a joint AR and diffusion objective, allowing it to retain its initial AR learning while gaining parallel drafting. It underwent pre-training on a massive 1.3 trillion tokens from the NVIDIA Nemotron Pretraining datasets, followed by an additional supervised fine-tuning phase using 45 billion tokens from the NVIDIA Nemotron Post-training datasets.

Deployment of Nemotron-Labs Diffusion models will soon be natively supported in the main branch of SGLang. Currently, inference support is available via a specific issue tracker request on GitHub. This integration allows developers to serve the same checkpoint in all three distinct generation modes by simply adjusting a single line in their algorithm configuration, offering unparalleled control and versatility.

Nemotron-Labs Diffusion truly brings diffusion-style generation into a practical, developer-friendly form. With its open models, familiar AR compatibility, advanced diffusion decoding, and powerful self-speculative acceleration, developers can now draft, refine, verify, and accelerate text generation without altering their existing applications. To embark on this new era of AI, explore the Nemotron-Labs Diffusion model family, delve into the technical report, and try out the available training recipe.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.

Breaking the Autoregressive Bottleneck in LLMs

Introducing Nemotron-Labs Diffusion: A New Paradigm for AI Generation

Unleash Unprecedented Speed with Flexible Generation Modes

Engineering the Future of Language Models: Training and Deployment

Kristine Vior

Related Posts