Boosting MoE Training Throughput with Advanced Fusion Kernels

Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable substantially larger model capacity while activating only a subset of parameters for each token, offering an unparalleled approach for scaling performance within a practical compute budget. As model scales continue to grow, the optimization of these blocks becomes critical for maximizing training throughput.

To push these boundaries, we are introducing advanced fused MLP kernels for dense and MoE models, custom-built with the NVIDIA CuTe DSL. By tackling inherent memory and synchronization bottlenecks, these new kernels deliver an impressive 1.3x–2x kernel-level speedup over unfused paths while enabling sync-free MoE execution for full-iteration NVIDIA CUDA graphs.

In the NVIDIA full-stack DeepSeek-V3 pre-training setup, this optimization contributes an 8% end-to-end performance improvement. Similarly for the GPT-OSS pre-training setup, this optimization contributes a 93% end-to-end performance improvement. Whether you want to slash training times or optimize hardware utilization, these kernels are available today in the NVIDIA cuDNN Frontend and can be seamlessly accessed through NVIDIA Transformer Engine and NVIDIA Megatron-Core.

To understand how, we need to take a systematic look into dismantling the three biggest bottlenecks plaguing modern MoE blocks and how we re-engineered the stack through hardware-aware software codesign to keep Tensor Cores continuously fed.

Overcoming training bottlenecks in the MoE block

To maximize the throughput of MoE models, we first had to map out exactly where compute cycles are being spent. When we profiled the execution timeline of a standard training iteration within the MoE block, three system-level bottlenecks stood out:

Activation Bottlenecks: Activation functions typically result in memory-bound kernels and large tensor read/write operations, leaving Tensor Cores underutilized during these intervals.
CPU boundedness/overhead: With routed experts, the tokens per expert are calculated at run time and are typically computed on the CPU. If the CPU cannot keep up with the GPU, the CPU operations get exposed. This calls for a need to build kernels which do not need CPU synchronization or intervention.
Quantization Cost: Just like activation functions, quantizing the tensors from high precision to lower precision results in memory bound kernels which keep the Tensor Cores idle.

We address these challenges in the re-design of the MoE block with custom kernels written in cuTE DSL and introduce a family of three kernels written for the sync-free MoE:

GroupGemm + Quantize
GroupGemm + Activation + Quantize/Transpose
GroupGemm + dActivation + Quantize/Transpose

The supported activation functions are SwiGLU, GeGLU, sReLU along with the option of adding clamping and scaling.

A flowchart illustrating kernel fusions in a MoE block. Shaded grey regions group the forward pass operations (FC1, Activation, and Quantize/Transpose) and the backward pass operations (FC2 dgrad, dActivation, and Quantize/Transpose) into single blocks. Arrows show the interconnected data flow between these fused kernels and the separate, unfused weight gradient blocks — *Figure 1. Fusing operations into a single custom kernel in the forward and backward pass with CuTe DSL*

Optimizing GLU activation functions via fused GEMM epilogues

The Gated linear functions have become very popular recently, and most of the modern models use some variant of Gated Linear Unit (GLU) activation functions, such as SwiGLU, GeGLU, etc. These activation functions chunk the output of the FC1 layer and combine them to write the final GLU output. We implement a fused kernel that seamlessly merges the GEMM with the corresponding GLU operation in both forward prop and back prop.

\(\text{SwiGLU}(x, W, V, b, c\beta)=\text{Swish}_{\beta}(xW + b) \otimes (xV + c)\)

GLU activation functions aren’t trivial to fuse within the epilog of the GEMM, as the GLU needs access to two different chunks of the tensor: input and gate. Typically, these two chunks would be computed by different thread blocks, and in order to combine the two outputs, the kernel needs to write both outputs to global memory. To achieve this fusion, we repack the weights into columns of input and gates. This ensures that the same thread block has access to both the half tile-width of the input tensor and the half tile-width of the gate tensor. This enables the input and gate to be combined in the epilogue without having to go to global memory. The repack can happen before the training starts, during the checkpoint loading.

Similarly in the back prop, epilog reads the GEMM output, calculates the dSwiGlu, quantizes it, and writes it back to global memory.

Diagram showing the SwiGLU fused-kernel data flow. Input and gate weights are swizzled into an interleaved layout so a single GEMM epilogue tile can access both paths, apply Swish to the gate values, and multiply them with the input values to generate the final SwiGLU output — *Figure 2. The Input and Gate weights get packed so that the thread block has access to both input and Gate weights to compute SwiGLU output within the CUDA core*

Notably, these fusion patterns don’t just eliminate the reads and writes of intermediate tensors, they also maximize utilization by overlapping any remaining memory operations directly with the GEMM itself.

Beyond core activation functions like SwiGLU, GeGLU, and sReLU, these kernels natively handle fused epilogue operations including feature scaling, tensor clamping, and bias vector additions.

Eliminating host-device synchronization and CPU launch overhead

Traditionally, the amount of work a kernel performs is defined by the block count at launch time, which requires shape information to be available on the host. For example, multi-stream grouped GEMM launches \(G\) different GEMMs on separate streams, where \(G\) is the number of groups. Because the number of tokens per group is determined at runtime, the CPU must launch these dynamically sized GEMMs on separate streams to maximize resource utilization.

This leads to two primary issues: First, the number of kernels to be launched scales with the number of local experts; and second, a synchronization point is mandatory to retrieve shape information on the host before kernel launch. To address these challenges, CuTe DSL GroupGEMM kernels track tokens per group within GPU memory itself. This eliminates CPU dependency during iteration and enables CUDA graphs across the entire iteration, effectively removing the CPU bottleneck.

Fusing MXFP8 and NVFP4 quantization to reduce exposed memory overhead

The popularity of lower precision recipes such as MXFP8 and NVFP4 for pretraining is rising, with these precisions providing significant speedup with minimal impact to accuracy. In these low precision recipes, the activation function is followed by quantization and transpose for the narrow precision GEMM operation.

For MXFP8, the quantization kernel reads the output of the activation function (BF16) and writes the MXFP8 output and a transposed version of the output for the backprop. Our newly designed kernels fuse this quantization step into the GEMM kernel itself, eliminating the additional read and write of the BF16 tensor. Similarly for NVFP4, the kernel produces the BF16 output and the per tensor amax (array-maximum) for the forward prop, and for the back prop, it calculates the amax for the transposed hadamard rotation of the output. This eliminates the need for the extra memory pass for the per tensor amax calculation.

From Kernel-level gains to pretraining speedups

Across unit-level microbenchmarks, these fused kernels deliver a substantial speedup—accelerating the forward pass by up to 1.3x and the backward pass by up to 2.1x compared to traditional unfused execution paths.

In order to translate these speedups to end-to-end training throughput boost, they also support features such as:

Dynamic Scheduling to support efficient overlap with other kernels such as communication from expert parallelism, data parallelism, etc.
Configurable Cluster Margin to enable users to reserve a configurable margin of SM resources by limiting the kernel to fewer SMs, which leaves headroom for other kernels to launch and execute concurrently on the GPU.

In addition to the per kernel speedups, since these sync-free kernels enable end-to-end CUDA graphs and efficient overlap with the communication kernels, there is a much larger speedup at the full application level. In internal testing, we see up to 8% end-to-end speedup on Deepseekv3 and up to 93% end-to-end speedup on GPT-OSS pre-training runs from these optimizations.

We are constantly adding new kernels and supporting new features to these kernels.

Bar chart comparing fused activation kernels with the unfused baseline kernels from Transformer Engine on NVIDIA GB200 across several activation-function patterns. The fused kernels improve performance in both forward and backward passes, with speedups reaching up to 1.3x for forward and 2.1x for backward — *Figure 3. Speedup on different activation functions patterns on GB200. The baseline is using the optimized kernels from transformer engine*

How to use CuTe DSL fused kernels to your advantage

These kernels are available to use at different abstraction levels.

cuDNN Front-end (v1.23.0+): The kernels are housed in the cuDNN Frontend library. Users can install the library in their software stack and invoke these kernels directly from there. CudNN-Frontend also provides a wrapper for these kernels, which compiles the kernel in the first invocation, and then re-uses the cached object for the subsequent calls. Users have an option to invoke the kernel directly or to access the kernels through the wrapper API. We are also actively working on bringing the AOT (Ahead of time) compilation support to the library for these kernels, so that the kernels can be compiled into cubins and cached in the disk.
Transformer Engine (v2.15+): Users can also use these kernels through the Transformer Engine. Transformer Engine exposes these operations through the transformer_engine.pytorch.ops construct. These operations can be combined using the transformer_engine.pytorch.ops.Sequential block, which internally pattern matches the ops to invoke the fused kernel from the cuDNN frontend library.
Megatron Core (26.04-alpha.rc2+): Users can also use these kernels through the megatron core, where the features can simply be invoked by using the right set of knobs.

Flow diagram with four stacked NVIDIA software layers: cuDNN Frontend, Transformer Engine, Megatron Core, and Megatron Bridge. Arrows connect the layers and point to code examples showing how to invoke fused grouped MLP kernels through cuDNN, Transformer Engine, or Megatron Core — *Figure 4*. Users can seamlessly choose to integrate these fusion kernels from any of the different abstraction layers in the CUDA stack: CuDNN Frontend, Transformer Engine or Megatron Core

What’s next?

We are actively working on multiple new features, such as supporting more fusion patterns, and supporting more frameworks such as JAX.

There are multiple kernel optimizations which are underway such as activation recompute, heuristics to pick the best kernels to compile, Ahead of Time (AOT) Compilation to reduce the compile cost, reducing CPU overheads, etc.

If you have an activation function you would like, we encourage users to tweak CuDNN kernels and contribute through PRs themselves. Or please add an issue for us to track the feature in cuDNN frontend.

Community feedback is very welcome!

Getting started

Follow the steps on GitHub to see how to run these kernels.