Unlock Peak LLM Inference: Asynchronous Batching Explained

In the evolving world of Large Language Models (LLMs), efficiency is key. While the first post in this series delved into the fundamentals of continuous batching—a technique that dramatically boosts GPU utilization by packing requests tightly and eliminating wasted compute on padding—there’s another significant hurdle we need to overcome: synchronous processing. This often overlooked bottleneck can significantly drag down your inference performance.

An H200 GPU, for instance, costs around $5 an hour on inference endpoints. While this might seem affordable for a short burst, prolonged use quickly adds up, reaching $140 for a full day. To maximize your investment and ensure your GPU is working tirelessly, we must address the issue of synchronous execution, where the CPU and GPU take turns, leading to substantial idle time. This article explores how asynchronous batching can solve this problem, unlocking a powerful performance boost for LLM inference.

The Hidden Cost of Synchronous Batching

By default, continuous batching operates synchronously. This means that while the GPU crunches numbers, the CPU waits. Conversely, when the CPU prepares the next batch—selecting requests, updating the KV cache, admitting new entries, and evicting completed ones—the GPU sits idle. These brief pauses, accumulating over hundreds of steps per second in a continuous batching loop, amount to significant wasted time.

We’ve observed that these idle gaps can account for nearly a quarter of the total runtime. To eliminate this inefficiency and ensure your GPU is always computing, we need to decouple CPU batch preparation from GPU batch computation. The solution? Asynchronous batching, which allows both components to work in parallel, keeping your GPU actively engaged.

How Synchronous Batching Works (and Fails)

Let’s break down the synchronous batching process. First, the CPU prepares a new batch, which involves intricate tasks like selecting requests, updating the KV cache table, and managing finished requests. Once this is complete, the prepared inputs are transferred to the GPU.

Next, the GPU executes its forward pass and samples a new token for each request. The results are then sent back to the CPU, informing it of the newly produced tokens, and the cycle repeats. The critical flaw here is that after the GPU finishes its computation, it goes idle, waiting for the CPU to complete its update step—sampling tokens, updating request states, and rescheduling the next batch. This sequential hand-off creates unnecessary delays.

This inherent inefficiency means the CPU and GPU are never performing useful work simultaneously. For a single forward pass, the delay might seem negligible, but in high-throughput continuous batching, these idle periods quickly accumulate, leading to significant throughput loss. Our profiling shows that generating 8,000 tokens with a batch size of 32 using an 8B model results in 24% of the total generation time spent with an idle GPU. This means nearly a quarter of your GPU’s potential is wasted.

Imagine if we could eliminate this CPU overhead entirely: generation time would drop from 300 seconds to 228 seconds, representing a free 24% speedup without any changes to kernels or models—just smarter hardware coordination. The fundamental idea is simple: prepare batch N+1 on the CPU while batch N is computing on the GPU. But this seemingly simple concept involves several technical challenges that we will tackle to build asynchronous batching from the ground up, just as it was implemented in the transformers library.

Achieving Concurrency with CUDA Streams

Our ultimate goal is concurrent execution of CPU and GPU operations. To achieve this, we need a mechanism to categorize and manage operations so that the system understands which ones can run in parallel. This is where CUDA streams come into play. A CUDA stream is essentially an ordered queue of GPU operations—kernel launches, memory copies, synchronization barriers—that execute sequentially within that stream. However, operations in different streams are independent and can run concurrently.

When you launch multiple operations across different streams, they can start almost simultaneously. While there’s a slight CPU launch overhead for each operation (finding the kernel, issuing the call, transferring the command), the core benefit is concurrent GPU execution. Understanding when a stream is “flushed”—meaning all operations within it have completed—will be crucial for tracking progress in asynchronous workflows.

The Default Stream vs. Non-Default Streams

If you’ve never explicitly used CUDA streams in PyTorch, you might assume GPU operations are synchronous because the CPU typically waits for the GPU. This perception is accurate and stems from the default stream. Any PyTorch operation without an explicit stream is scheduled on the default stream, which has a critical synchronizing property: it waits for all other streams to be flushed before it begins. Conversely, any other operation, regardless of its stream, waits for the default stream to be flushed before launching.

This synchronizing behavior effectively negates any attempts at concurrency. For example, transferring results from a default stream operation to the CPU will block the CPU until all GPU operations are complete, even if the transfer itself is designed to be non-blocking. To truly unlock concurrency and regain immediate CPU control after launching GPU work, we must exclusively use non-default streams. When a kernel launch or non-blocking memory copy is enqueued on a non-default stream, control returns to the CPU instantly, allowing it to continue processing while the GPU operates in the background.

From this point forward, we’ll assume all memory transfers between devices are non-blocking and will require explicit synchronization. For continuous batching, we can identify three distinct GPU operations: Host-to-Device (H2D) transfers (CPU to GPU), computation (forward pass), and Device-to-Host (D2H) transfers (GPU to CPU). Each of these requires its own non-default stream: the H2D stream, the compute stream, and the D2H stream.

However, simply launching operations on separate streams isn’t enough. If we try to launch H2D, compute, and D2H operations on independent streams, they will all start almost simultaneously. The compute stream might run on old data, and the D2H stream might transfer uncomputed results because streams are independent and don’t automatically wait for each other. We need a mechanism to enforce sequential ordering across these independent streams.

Enforcing Synchronization with CUDA Events

To enforce the correct execution order between our streams, we use CUDA events. A CUDA event is a marker that can be recorded into a stream. When the GPU reaches this marker during execution, it signals that the event is complete. Other streams can then be instructed to wait for this event before proceeding with their next operation.

The two key operations are stream.record(event), which places the marker in a stream at its current position, and stream.wait(event), which blocks a stream from continuing until the event is marked complete. Crucially, wait blocks only the specific stream, not the CPU or other parallel streams. The CPU call returns immediately, allowing the CPU to continue its work while the GPU manages its internal dependencies.

Events in Continuous Batching

Applying this to continuous batching, the solution is straightforward. After enqueueing the H2D transfer, we call h2d_stream.record(h2d_done). This event will complete only when the transfer finishes. Before the forward pass, we call compute_stream.wait(h2d_done), ensuring computation doesn’t start until the data is ready. We repeat this for the D2H transfer: after the forward pass, we record compute_stream.record(compute_done), then instruct d2h_stream.wait(compute_done) before enqueueing the output transfer. This creates a meticulously ordered pipeline:

CPU prepares batch N+1.
CPU enqueues H2D transfer for batch N+1 on h2d_stream.
CPU records h2d_done event on h2d_stream.
CPU enqueues compute_stream.wait(h2d_done).
CPU enqueues forward pass for batch N+1 on compute_stream.
CPU records compute_done event on compute_stream.
CPU enqueues d2h_stream.wait(compute_done).
CPU enqueues D2H transfer for batch N+1 on d2h_stream.
CPU returns immediately to prepare batch N+2.

The CPU enqueues all these operations in rapid succession and then moves on, completely unblocked. The GPU then takes over, executing each stream sequentially as its dependencies (the events) are met. This green-annotated section illustrates how the CPU orchestrates the launches and then remains free, while the GPU meticulously manages the ordered execution across its dedicated streams, ensuring continuous, efficient processing.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.