Mixture-of-experts (MoE) models have quickly become a foundational component of modern, large-scale AI systems. They are widely adopted because they enable substantially larger model capacity while activating only a subset of parameters for each token, offering an unparalleled approach for scaling performance within a practical compute budget. As model scales continue to grow, the optimization of these blocks becomes critical for maximizing training throughput.
To push these boundaries, we are introducing advanced fused MLP kernels for dense and MoE models, custom-built with the NVIDIA CuTe DSL. By tackling inherent memory and synchronization bottlenecks, these new kernels deliver an impressive 1.3x–2x kernel-level speedup over unfused paths while enabling sync-free MoE execution for full-iteration NVIDIA CUDA graphs.
In the NVIDIA full-stack DeepSeek-V3 pre-training setup, this optimization contributes an 8% end-to-end performance improvement. Similarly for the GPT-OSS pre-training setup, this optimization contributes a 93% end-to-end performance improvement. Whether you want to slash training times or optimize hardware utilization, these kernels are available today in the NVIDIA cuDNN Frontend and can be seamlessly accessed through NVIDIA Transformer Engine and NVIDIA Megatron-Core.
To understand how, we need to take a systematic look into dismantling the three biggest bottlenecks plaguing modern MoE blocks and how we re-engineered the stack through hardware-aware software codesign to keep Tensor Cores continuously fed.
To maximize the throughput of MoE models, we first had to map out exactly where compute cycles are being spent. When we profiled the execution timeline of a standard training iteration within the MoE block, three system-level bottlenecks stood out:
We address these challenges in the re-design of the MoE block with custom kernels written in cuTE DSL and introduce a family of three kernels written for the sync-free MoE:
The supported activation functions are SwiGLU, GeGLU, sReLU along with the option of adding clamping and scaling.

The Gated linear functions have become very popular recently, and most of the modern models use some variant of Gated Linear Unit (GLU) activation functions, such as SwiGLU, GeGLU, etc. These activation functions chunk the output of the FC1 layer and combine them to write the final GLU output. We implement a fused kernel that seamlessly merges the GEMM with the corresponding GLU operation in both forward prop and back prop.
\(\text{SwiGLU}(x, W, V, b, c\beta)=\text{Swish}_{\beta}(xW + b) \otimes (xV + c)\)
GLU activation functions aren’t trivial to fuse within the epilog of the GEMM, as the GLU needs access to two different chunks of the tensor: input and gate. Typically, these two chunks would be computed by different thread blocks, and in order to combine the two outputs, the kernel needs to write both outputs to global memory. To achieve this fusion, we repack the weights into columns of input and gates. This ensures that the same thread block has access to both the half tile-width of the input tensor and the half tile-width of the gate tensor. This enables the input and gate to be combined in the epilogue without having to go to global memory. The repack can happen before the training starts, during the checkpoint loading.
Similarly in the back prop, epilog reads the GEMM output, calculates the dSwiGlu, quantizes it, and writes it back to global memory.

Notably, these fusion patterns don’t just eliminate the reads and writes of intermediate tensors, they also maximize utilization by overlapping any remaining memory operations directly with the GEMM itself.
Beyond core activation functions like SwiGLU, GeGLU, and sReLU, these kernels natively handle fused epilogue operations including feature scaling, tensor clamping, and bias vector additions.
Traditionally, the amount of work a kernel performs is defined by the block count at launch time, which requires shape information to be available on the host. For example, multi-stream grouped GEMM launches \(G\) different GEMMs on separate streams, where \(G\) is the number of groups. Because the number of tokens per group is determined at runtime, the CPU must launch these dynamically sized GEMMs on separate streams to maximize resource utilization.
This leads to two primary issues: First, the number of kernels to be launched scales with the number of local experts; and second, a synchronization point is mandatory to retrieve shape information on the host before kernel launch. To address these challenges, CuTe DSL GroupGEMM kernels track tokens per group within GPU memory itself. This eliminates CPU dependency during iteration and enables CUDA graphs across the entire iteration, effectively removing the CPU bottleneck.
The popularity of lower precision recipes such as MXFP8 and NVFP4 for pretraining is rising, with these precisions providing significant speedup with minimal impact to accuracy. In these low precision recipes, the activation function is followed by quantization and transpose for the narrow precision GEMM operation.
For MXFP8, the quantization kernel reads the output of the activation function (BF16) and writes the MXFP8 output and a transposed version of the output for the backprop. Our newly designed kernels fuse this quantization step into the GEMM kernel itself, eliminating the additional read and write of the BF16 tensor. Similarly for NVFP4, the kernel produces the BF16 output and the per tensor amax (array-maximum) for the forward prop, and for the back prop, it calculates the amax for the transposed hadamard rotation of the output. This eliminates the need for the extra memory pass for the per tensor amax calculation.
Across unit-level microbenchmarks, these fused kernels deliver a substantial speedup—accelerating the forward pass by up to 1.3x and the backward pass by up to 2.1x compared to traditional unfused execution paths.
In order to translate these speedups to end-to-end training throughput boost, they also support features such as:
In addition to the per kernel speedups, since these sync-free kernels enable end-to-end CUDA graphs and efficient overlap with the communication kernels, there is a much larger speedup at the full application level. In internal testing, we see up to 8% end-to-end speedup on Deepseekv3 and up to 93% end-to-end speedup on GPT-OSS pre-training runs from these optimizations.
We are constantly adding new kernels and supporting new features to these kernels.

These kernels are available to use at different abstraction levels.
transformer_engine.pytorch.ops construct. These operations can be combined using the transformer_engine.pytorch.ops.Sequential block, which internally pattern matches the ops to invoke the fused kernel from the cuDNN frontend library.
We are actively working on multiple new features, such as supporting more fusion patterns, and supporting more frameworks such as JAX.
There are multiple kernel optimizations which are underway such as activation recompute, heuristics to pick the best kernels to compile, Ahead of Time (AOT) Compilation to reduce the compile cost, reducing CPU overheads, etc.
If you have an activation function you would like, we encourage users to tweak CuDNN kernels and contribute through PRs themselves. Or please add an issue for us to track the feature in cuDNN frontend.
Community feedback is very welcome!
Follow the steps on GitHub to see how to run these kernels.