AI.news
主页教程研究工具模型AI创业讨论新闻每日简报WIKI🚀 创业库★ 投稿
AI+医疗机器人教育金融能源健康娱乐思考

Unlocking Feature Learning in Gated Delta Networks at Scale

arxiv.org
分享

View PDF HTML (experimental)

Abstract:Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($\mu$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.

Submission history

From: Yifeng Liu [view email]
[v1] Tue, 2 Jun 2026 08:45:24 UTC (240 KB)