Optimize PyTorch: How to Profile nn.Linear & Fused MLPs

Welcome back to our journey into the fascinating world of PyTorch profiling! In the first installment of this series, we delved into the basics, learning to interpret PyTorch profiler traces by examining a simple torch.add(torch.matmul(x, w), b) operation. We uncovered crucial concepts like the CPU dispatch chain, launch overhead, and the distinction between overhead-bound and compute-bound regimes, along with a peek into torch.compile internals.

Now, we’re taking things a step further. This blog post elevates our understanding by replacing that handcrafted matmul-add pair with PyTorch’s fundamental building block: nn.Linear (specifically, with bias=True). We’ll then stack three of these layers, interspersed with an activation function, to construct a practical Multilayer Perceptron (MLP) block.

To follow along, you’ll find the scripts—02_linear.py, 03_simple_mlp.py, and 03_kernels_mlp.py—readily available. It’s highly recommended to open them in a separate tab and walk through the code as you read. All our experiments are conducted on an NVIDIA A100-SXM4-80GB GPU, easily accessible through Hugging Face infrastructure via Dev Mode with Spaces or the Hugging Face Jobs pipeline.

From Matmul-Add to nn.Linear: The Foundation

Let’s briefly recap two core ideas we’ll be leaning on: nn.Linear is essentially a sophisticated wrapper around the same matrix multiplication and addition operations we explored in Part 1. The key difference is that it manages its own weight and bias as parameters and provides a user-friendly forward method. This abstraction simplifies model construction significantly.

The underlying mathematical operation performed by nn.Linear with bias=True is y = x @ w.T + b, where x is the input, w is the weight matrix, and b is the bias vector. We can illustrate this with a simple code snippet:

linear_layer = nn.Linear(in_dim, out_dim, bias=True)
y = linear_layer(x)

Running 02_linear.py with specific batch, input, and output dimensions, and then utilizing trace-util, allows us to inspect the profiler trace. We employ a similar schedule setup as before (wait=1, warmup=1, active=3), which is why you’ll observe three “Profile Steps” in both the CPU and GPU lanes of the trace.

The Transpose Mystery and Fused Kernels

Upon closer inspection of the profiler trace for nn.Linear‘s forward pass, you might notice an aten::t (transpose) operation preceding aten::addmm. This indicates that nn.Linear transposes the weight parameter before multiplying it with the input. However, it’s crucial to understand that aten::t does not launch a GPU kernel; it’s a CPU-side metadata operation that rewrites the tensor’s shape and stride, effectively creating a “view” without copying data.

Another interesting observation is the absence of a separate aten::add operation for bias. This is due to a powerful optimization technique called an epilogue. The bias addition is folded directly into the matrix multiplication kernel, avoiding the overhead of a separate memory load/write operation. aten::linear, called by nn.Linear, intelligently dispatches aten::addmm(bias, x, weight), which leverages a bias-add variant of the cuBLAS GEMM kernel. This means the addition is part of the matmul kernel’s writeback phase, making it incredibly efficient.

This insight leads to a subtle but important point: the addmm kernel you observed in Part 1 when using --compile is often what eager nn.Linear already utilizes. For a single GEMM-with-bias, torch.compile might have limited opportunities for further fusion, as the operation is already highly optimized. We’ll verify this by examining its impact on a single nn.Linear layer.

The Power of torch.compile on a Single Linear Layer

When comparing the eager and compiled profiler traces for a single nn.Linear layer’s forward pass, you’ll find remarkable similarities on the GPU side. This reinforces the idea that torch.compile often needs more than one operation to perform significant fusion. The key differences typically manifest on the CPU side.

The eager CPU dispatch chain for aten::linear includes aten::t followed by aten::addmm. As discussed, aten::t is a view operation that doesn’t copy data but rather swaps strides metadata. When compiled, Inductor traces through this view chain at compile time, precomputing the resulting strides. It then emits a direct aten::addmm call with these strides hard-coded, eliminating the CPU overhead of dispatching the view at runtime.

This means torch.compile doesn’t remove a GPU kernel; it removes CPU overhead. The GPU still executes the same cutlass_80_wmma_tensorop_bf16_s161616gemm_bf16_32x32_32x1_tn_align8 kernel. The _tn_ suffix in the kernel name is crucial here, indicating a “transposed” layout for the second input (the weight matrix). cuBLAS and CUTLASS precompile various kernel binaries for different input layout combinations, and the dispatcher selects the appropriate one based on input strides. Learning to decipher these kernel names is a highly valuable skill for comparing traces.

Stacking Linears: Profiling the MLP

Now, let’s move to a more complex scenario: a Multilayer Perceptron (MLP). For added interest, we’ll profile a feed-forward network using the GeGLU activation variant, a common choice in modern deep learning. This showcases how individual optimized layers combine within a larger architecture.

Consider the structure of our SimpleGeGLUMLP module:

self.gate_proj = nn.Linear(dim, hidden, bias=False)
self.up_proj = nn.Linear(dim, hidden, bias=False)
self.down_proj = nn.Linear(hidden, dim, bias=False)
It applies F.gelu(g, approximate="tanh") and a element-wise multiplication h * u.

Before running 03_simple_mlp.py, let’s predict what we should see in the profiler trace. We expect three aten::linear dispatches, one for each nn.Linear layer. Additionally, we anticipate two pointwise kernel launches: one for the GeLU activation and another for the element-wise multiplication. Forming these expectations beforehand is crucial for effective profiling.

Indeed, our intuition holds true. For each forward pass of the MLP, the GPU executes exactly five kernels. You’ll observe “occupancy query” calls in the CPU lane for the linear projection layers, which are cuBLAS sizing the grid before launching the GEMM kernels. Pointwise operations like GeLU and multiplication launch directly without these queries.

Decoding GEMM Kernels and Performance Differences

It’s important to reiterate that operations like aten::t, aten::transpose, aten::reshape, aten::view, and aten::as_strided launch zero GPU kernels. They register 0.000µs of CUDA time because they only modify tensor metadata on the CPU. A quick scan of the profiler table might show several op names per linear layer, but typically only one (the mm or addmm kernel) actually executes on the GPU.

Interestingly, while all three GEMMs in our MLP (gate_proj, up_proj, down_proj) perform the same number of floating-point operations (FLOPs), the down_proj layer is about 10% faster. This performance difference arises because the down_proj has a different shape (N=768 instead of 3072 for the other two). cuBLAS, in response, selects a different tile size (e.g., 128x256 with a deeper pipeline), which achieves better data reuse for that specific shape.

This explains why the profiler table often displays two distinct GEMM rows: one for the 128x128 tile used by gate_proj and up_proj, and another for the optimized tile chosen by down_proj. Understanding these nuances in kernel selection and performance characteristics is key to advanced PyTorch profiling and optimization.

Source: Hugging Face Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.