Hugging Face study looks at PyTorch gains from fused MLPs

Study finds limited compiler gains for single linear layers

A new Hugging Face analysis of PyTorch performance takes a closer look at when fusion actually helps. The second installment in the company’s profiling series moves from a simple matrix multiply and add example to nn.Linear, then builds a small multilayer perceptron, or MLP, to see how the framework behaves under eager execution and with torch.compile.

The post argues that a single linear layer already benefits from internal optimization. In PyTorch, nn.Linear does not perform a matrix multiplication and a bias addition as two separate GPU operations. Instead, it routes through a fused matrix multiplication path that incorporates the bias in the kernel’s writeback stage. That means the bias is handled as an epilogue, a small calculation performed at the end of a GEMM kernel to avoid extra memory traffic.

The authors use profiler traces on an NVIDIA A100-SXM4-80GB GPU to show how the operation is executed. In the eager path, the trace includes a transpose step on the CPU before the multiply-add call. The writeup emphasizes that this transpose is not a data-moving operation. It changes tensor metadata, such as shape and stride, to present the same weight values in transposed form.

Why `torch.compile` changes little for one layer

One of the key conclusions is that torch.compile offers little benefit for a lone nn.Linear layer. The compiled and eager traces show the same GPU kernel and the same aten::addmm call. The compiler does add some CPU-side overhead of its own, but there is no extra kernel fusion opportunity because the linear layer is already using a fused implementation.

The article frames this as an important reminder for users who may expect torch.compile to speed up every workload automatically. According to the authors, compilation is more likely to help when there are multiple operations that can be combined, rather than a single fused matrix multiplication with bias.

The series also explains what happens to the transpose step in compiled mode. The compiler traces through the view operation ahead of time and computes the resulting layout once, reducing runtime dispatch work on the CPU. The GPU work remains the same, but the CPU no longer has to walk the same sequence of metadata operations each time the model runs.

Moving from one layer to an MLP

The next step in the post is a small MLP made from stacked linear layers with an activation between them. That structure gives the compiler more room to optimize, since there are now multiple operations in sequence. The article says this is where a fused MLP implementation becomes more interesting, because adjacent operations can potentially be lowered more efficiently than isolated layers.

To study that behavior, the authors provide scripts for the linear layer, a simple MLP, and a kernel-level MLP variant. They also point readers to Hugging Face tools for running the experiments, including Dev Mode with Spaces and the Jobs pipeline.

The broader message of the piece is that profiling matters as much as compilation. By looking at traces rather than assuming a compiler will help, developers can see which parts of a model are already fused, which are only metadata operations, and where extra optimization work is actually available.