Over the past decade, deep learning as a field has grown quite significantly, whether it be the compute capacity of hardware or the ingenuity behind architectures that utilize that hardware. But if you think about it for more than a second, the underlying architecture has remained consistent in a few key areas. We’ve seen a massive shift from convolutional networks to the new Transformer architectures that power today’s large language models, but the way these networks route information from one layer to another hasn’t changed all that much.
Recently, researchers at DeepSeek-AI released a paper titled “mHC: Manifold-Constrained Hyper-Connections,” (Xie et al., 2025b)1 which proposes an entirely new redesign of this routing system. To really appreciate the solution they came up with, let’s look at how signal propagation has evolved over the past few generations of models, and why the current methods are hitting a wall.
Firstly, to understand the specific problem that the authors are trying to solve, we need to talk about where it all started–The standard Residual Connection (He et al., 2015)2. Introduced back in 2015 with ResNets, the residual connection is arguably one of the most important architectural design choices used in every AI model out there.

Mathematically, it looks like this:

It simply means that the final output of a layer is the sum of its output and the input it originally got. The key component here is that bare xl term in the residual stream, which we call the identity mapping. It’s important because it acts as an uninterrupted pathway for the gradient signal to flow through the entire network from start to finish. This property is exactly what prevents gradients from vanishing or exploding during training and allows us to successfully train models with hundreds of layers while still ensuring each layer learns and updates itself effectively.
But as models have grown increasingly massive, we’ve started to hit the limits of this straightforward approach.
In a standard transformer model, we can imagine the residual stream as having a fixed width, which we can refer to as dimension C. Every piece of context, memory, and feature representation has to be crammed into this single C-dimensional vector as it moves up the network. Over time, as the model layers make the information more abstract and expressive, the xl term from the residual stream then becomes the information bottleneck.
Typically, if you want to increase the representational capacity of the model, you have to increase the size of the computational layers or add more layers. But by doing that, you also massively increase the compute requirements to run the very model.
Because of the above-stated limitation, researchers at ByteDance introduced an alternative to the vanilla residual stream, known as Hyper-Connections (Zhu et al., 2024)3.

If the normal residual streams are just too “thin”, HC widens them. Instead of relying on a single stream of width C, the idea is to expand the width of the residual stream by a specific factor, let’s say n. So what you now end up with is a wider vector composed of n parallel streams, resulting in a total width of n×C.
But since the actual computational layers of the model, like the Attention and MLP blocks, still expect a standard input with C dimensions only, HC introduces a set of learnable weights to convert the vector between the wide and narrow stream:
C.n parallel streams as the signal moves forward.Fundamentally, by doing this, HC successfully increases the network’s capacity and makes the residual stream more expressive. The residual mapping matrix now enables the residual stream to not only allow the unperturbed signal to flow, but also the interactions between the channel dimensions. It allows the model to maintain a much richer internal representation across multiple streams, without increasing the compute cost of the main layers.
The reality of the situation, however, is that while HC looks great on paper, it introduces a couple of fatal flaws when you try to scale it up to the size of what our current LLMs are:
n forces the memory hardware to read and write significantly more data at every single step. Since memory access—not the actual computation—is often the biggest bottleneck in modern AI training, this extra overhead tanks training throughput and spikes the GPU memory footprint by a substantial margin.So, the researchers at DeepSeek were left with a very specific problem: how do you keep the expressive, wide streams of the HC paradigm, without destroying the mathematical stability of the network, and without saturating the GPU memory and I/O operations?
Let’s have a look at how they solved this.
To solve these two massive issues prevalent in HC, the DeepSeek team proposed a modified framework which they call Manifold-Constrained Hyper-Connections, or mHC.
The solution is broken down into two distinct parts. First, they had to fix the underlying math to stop the signal from exploding/vanishing. Second, they had to do some hardcore systems engineering to make sure the fix could actually run efficiently on modern GPUs. Let’s break down exactly how they did both of these.
The brilliant mathematical insight here was to take that problematic, unconstrained Residual Mapping matrix and mathematically force it to behave in a constrained manner. To do that, they projected the matrix onto a specific mathematical space known as the Birkhoff polytope.
In simpler terms, they constrained the matrix so that it becomes a doubly stochastic matrix.
If you aren’t familiar with the term, a doubly stochastic matrix is a matrix where all the numbers are non-negative, and every row sums up to exactly 1, and every column also sums up to exactly 1.

By forcing the residual matrix into this specific format, authors made sure of a few highly beneficial mathematical properties:
n parallel streams without artificially amplifying the overall “energy” of the signal.To actually turn a regular matrix into a doubly stochastic one during training, the researchers used something called the Sinkhorn-Knopp algorithm (Sinkhorn & Knopp, 1967)5. During the forward pass, the algorithm first makes all the numbers in the matrix positive, and then iteratively rescales the rows and columns until they all sum to 1.
Solving the math on paper is great, but running all these wide streams and iterative Sinkhorn-Knopp calculations sounds like a nightmare for GPU memory. To get around this, the DeepSeek team implemented some aggressive infrastructure optimizations:
Ultimately, the result of all this systems engineering pays off. Despite all the added math and wider streams, mHC only adds a tiny 6.7% time overhead during training compared to a standard baseline model.
To see if all the math and system engineering actually paid off, the DeepSeek team put mHC to the test. They trained several language models based on the DeepSeek-V3 architecture (DeepSeek-Ai et al., 2024)4, scaling all the way up to a 27-billion parameter model. They compared their new mHC framework directly against a standard residual baseline and the unconstrained, unstable HC paradigm. Let’s take a look at how the experiments played out.
The main motivation behind mHC was to mitigate the erratic training behaviour that was observed in HC due to the unconstrained mapping matrices. As shown below, the standard HC model’s gradient norm (graph b) starts to destabilize with wild swings at around 12k steps, which is exactly the moment where we see the HC and mHC loss plots drift apart (graph a). Because of the smoother and more stable gradient norms with mHC, the model ultimately achieves a lower final training loss when compared to the vanilla HC.

A stable model is only useful if it’s actually smarter. To prove this, the authors evaluated the 27B variant across multiple downstream benchmarks, including MATH, MMLU, and reasoning tasks like BBH and DROP. As expected, the mHC-enabled model showed consistent performance gains across the board, and especially surpassed the unconstrained HC on a majority of benchmarks. The reasoning benchmarks saw a particularly nice gain in performance, indicating that the wider residual streams are actively contributing to a more expressive model.

An important test for any new deep-learning architectural paradigm is if it obeys the pre-established scaling laws or not. Some design choices which work for a 3B parameter model might fail or backfire for a 27B parameter model. To ensure this, the authors plotted the compute scaling curves for 3B, 9B, and 24B parameter models. The below shown graphs clearly demonstrate that the relative loss improvement is maintained across all scales, validating that mHC is a scalable architectural upgrade.

As a final test, the authors also tested one of their claims directly: that the signal should not explode arbitrarily when stacked under multiple layers. For the standard unconstrained HC, we saw how the signal can be amplified by a factor of 3,000, which threw the gradients off completely during training. To see if mHC fixed this issue directly or not, DeepSeek tracked the signal propagation dynamics layer-by-layer in the model, and the results were as expected. Due to the doubly-stochastic mapping matrices, the signal gain was capped at around 1.6 throughout the model, proving that the signal remained stable even after compounding it across multiple layers.

Before the end, let’s discuss about some of the flaws of mHC, as every engineering choice involves a trade-off. While mHC is a good alternative to the instability of Hyper-Connections, it does come with a few caveats that are worth mentioning.
At the end of the day, the “mHC: Manifold-Constrained Hyper-Connections” paper is quite substantial research output by DeepSeek. It beautifully highlights what it takes to actually push the boundaries of foundational models today: you need a deep understanding of pure mathematics to diagnose the theoretical flaws, and you need hardcore systems engineering to make the solution actually run on physical silicon and make it practically viable.
The standard residual connection has been incredibly useful for the last decade, but as we push into the trillion-parameter era, we need pathways that can carry much richer, wider representations without affecting the stability of the network. DeepSeek has demonstrated one of the ways of achieveing wider and more representative pathways and innovated an aspect of architecture previously thought to be unchanging.
As for adoptability, will we see mHC accepted and implemented rapidly? Probably not. Because of the heavy reliance on custom GPU kernels and complex pipeline scheduling, it has quite a steep barrier which will likely take some time to be abstracted away into an easy-to-use plug-and-play module for the wider community. However, DeepSeek has already proven it works at scale in their own highly competitive roster of models.
Given the clear improvements in reasoning benchmarks and training stability, I fully expect well-resourced AI labs to start adopting and experimenting with mHC in their next-generation of architectures. It’s a big step forward, and it proves that there is still plenty of room to innovate on the most fundamental building blocks of neural networks.