Want to Accelerate LLM Training? Why Not Try Muon?

2025年12月10日

This article is a machine translation from Japanese. It may contain translation errors.

Hello, I’m Tokunaga from PredNext.

Today I’m talking about everyone’s favorite topic: optimizers. That said, it’s nearly impossible to keep up with all of them, so I’ve narrowed it down to a few Muon-related topics.

What is Muon?

First, let me explain about Muon. Muon is a type of optimizer used for neural network optimization that approximately orthogonalizes the moving average of gradients (what’s called momentum) using a technique called Newton-Schulz iteration, and uses the result to update parameters. Aside from the Newton-Schulz-based orthogonalization, Muon behaves like Momentum SGD.

In LLM training, optimizers such as Adam and its improved versions RAdam/AdamW have been mainly used. Many new optimizers have been proposed every year, but most of them have not been practically used, and the standard position has been occupied by Adam proposed in 2014 and its minor improvements.

While countless optimizers have disappeared along with their papers, Muon’s proposer took a different route: using Muon in a competition called nanoGPT-speedrun. Perhaps because it gained attention by achieving overwhelming results there, this year, cases have emerged where Muon is adopted in the training of several large LLMs. As far as I know, three models—Kimi K2, GLM-4.5, and INTELLECT-3—have been trained with Muon (INTELLECT-3 uses GLM-4.5 as a pretrained model, so that makes sense).

With use cases accumulated to this extent, it must be acknowledged that Muon is an excellent optimizer. Considering merits such as the number of steps until convergence being about 2/3, and optimizer memory consumption being somewhat less than Adam/AdamW, Muon adoption should continue to progress. Also, cases of using Muon to train ResNet and ViT are increasing. It seems to have versatility that works not only with language data but also with image data.

Aside: Trinity is also reportedly trained using Muon. The currently available Trinity Mini is a 26B model, so it’s not large enough to be compared with the above three, but Trinity Large, a 420B model, is scheduled to be released in January 2026.

Which Muon Should I Use?

Several implementations of Muon exist. KellerJordan/Muon, implementation inside modded-nanogpt, PyTorch implementation, Optax implementation, etc. For most use cases, the PyTorch or Optax implementation is the safest choice. The modded-nanogpt implementation has some pitfalls when training large models, such as not including weight decay and only being able to choose max(1,A/B)\sqrt{\max(1, A/B)} for scaling factor calculation. For PyTorch, using adjust_lr_fn="match_rms_adamw" is recommended as a first move.

Muon Pitfalls

As mentioned in Muon is Scalable for LLM Training’s 3.5.1 Ablation Studies on the Interchangeability of Pretrain and SFT Optimizers, it’s known that fine-tuning a model trained with Muon using AdamW, or vice versa, doesn’t work very well. When using it for continued pre-training, etc., it’s safe to use the same optimizer used in pre-training, so be careful.

I’ve lined up papers related to Muon that came out this year that I found interesting. They’re arranged in chronological order.

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

By using a technique called Polar Express instead of Newton-Schulz for Muon’s orthogonalization part, convergence could be accelerated. When I read the paper in July, it hadn’t been adopted in nanoGPT-speedrun, and I was puzzled why it wasn’t being used, but when I checked the official repository just now, it was happily adopted on 9/29.

By the way, around the time I wrote an article before, nanoGPT-speedrun had just barely broken 3 minutes, but now the record has shrunk to 2.248 minutes. At this rate, it seems it will break 2 minutes next year, and the evolution speed is so fast that my emotions can’t keep up.

MuLoCo: Muon is a practical inner optimizer for DiLoCo

DiLoCo and Muon together. That’s already more than enough to make my day. And on top of that, it even includes everyone’s favorite quantization.

DiLoCo is a method for distributed training that trains using two optimizers: an outer optimizer and an inner optimizer. For an overview of the method, please refer to what I wrote on my personal blog before. AdamW is often used as the inner optimizer, but changing this to Muon accelerates convergence, as most would expect. Not only that, even if the gradient information shared with the outer optimizer is quantized to 2 bits, when the inner optimizer is Muon, the loss value hardly changes and communication volume can be greatly reduced. Incidentally, with AdamW, performance degrades considerably. I have no idea why gradient quantization is possible with Muon, but since it works, we have no choice but to accept it.

Large-scale high-speed networks are very expensive, so methods like MuLoCo will increase in importance. Some, like Prime Intellect, are already practically using DiLoCo.

Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials

Newton-Schulz used in Muon is truncated midway for speedup, so only approximate results can be obtained. This research tried finding optimal coefficients using the Remez algorithm to improve approximation accuracy. This “optimal” means minimizing approximation error in the worst case. When the iteration count is 4, the proposed method clearly accelerates convergence, but when the iteration count becomes 5, the difference almost disappears. However, there’s also talk that increasing orthogonalization accuracy allows increasing the learning rate, so hyperparameter search might yield different results.

Incidentally, I learned about the Remez algorithm from Professor Okazaki’s blog article at Tokyo Institute of Technology. In the past, various people were writing blog articles, and knowledge obtained from such places is still useful today like this. I feel sad that articles that can no longer be read increase year by year due to reasons like the end of service of blog services themselves.

NorMuon: Making Muon more efficient and scalable

Introduces second-order moments to accelerate Muon’s convergence. However, since a naive implementation would significantly increase memory consumption, second-order moments are held column by column. In experiments, convergence was reportedly about 11% faster for a 1.1B model and about 5% faster for a 5.4B model compared to Muon. Currently adopted in nanoGPT-speedrun is a combination of this method and Polar Express.

Cautious Weight Decay

This paper is not research about Muon itself, but includes experiments using Muon. As a paper using an idea similar to Cautious Weight Decay, there’s Cautious Optimizers: Improving Training with One Line of Code. In Cautious Optimizers, weights are updated only when the signs of gradient and update direction match. I wrote an explanation of this on my personal blog before. I also wrote that applying it to Muon had no effect on nanoGPT-speedrun.

Cautious Weight Decay applies Weight Decay only when the sign of weights and update direction match. Application results to Muon are described in the paper, and in image classification experiments using ViT-S/16, ResNet-50, ViT-B/16, and text task experiments using OLMo 1B, convergence consistently accelerated and final accuracy also improved slightly. It’s already adopted in nanoGPT-speedrun.

Beyond the Ideal: Analyzing the Inexact Muon Update

Existing theoretical analysis of Muon assumed SVD for the orthogonalization part, but actual Muon uses Newton-Schulz, so there’s potential for a gap between theory and reality. This paper performs analysis closer to reality using an analysis method called linear minimization oracle. As a result, it showed that the more inaccurate the oracle (the further Newton-Schulz orthogonalization is from ideal), the smaller the optimal step width, i.e., the smaller the optimal learning rate. In other words, the claim is that the more thoroughly orthogonalization is done, the more the learning rate can be increased. Since it’s known that larger learning rates accelerate convergence and reduce final loss, being able to use large learning rates is welcome. Experiments using nanoGPT-speedrun code showed that optimal learning rate changes depending on Newton-Schulz iteration count, and analysis results and experimental results showed the same tendency.

MuonAll: Muon Variant for Efficient Finetuning of Large Language Models

Muon is for use with matrices, so it’s customary to optimize 1D parameters with AdamW, but this research asks what would happen if 1D parameters were deliberately diagonalized and optimized with Muon. SFT is performed on several networks, but AdamW wins, Muon wins, MuonAll wins—honestly, no trend is visible. As mentioned above, continuing to train with Muon on networks trained with Adam/AdamW doesn’t work well, so I think the experimental setup isn’t very good. However, being able to optimize with only Muon is an interesting idea. I’d like to see what happens when used for pretraining too.

Turbo-Muon: Accelerating Orthogonality-Based Optimization with Pre-Conditioning

Muon executes matrix multiplication about 10-15 times during orthogonalization, so it’s known that the time per step is longer than AdamW. Turbo-Muon accelerates Muon by reducing Newton-Schulz loop execution count. Simply reducing the number of matrix multiplications makes orthogonalization insufficient, so changes are made to preprocessing to reduce orthogonalization error.

If the matrix to be orthogonalized is X0X_0, Muon uses 1X0\frac{1}{\|X_0\|} as the scaling factor. X0\|X_0\| is the Frobenius norm of X0X_0, i.e., the square root of the sum of squares of each matrix element.

In Turbo-Muon, setting A0=X0TX0A_0 = X_{0}^{T} X_0, 1A0 \frac{1}{\sqrt{\|A_0\|}} is used as the scaling factor.

The difference is almost just this (in implementation, there’s a small trick to avoid unnecessary recalculation), but in Turbo-Muon, orthogonalization error decreases significantly, and by that amount, loop count can be reduced, resulting in faster speed than Muon. Training a 1.3B model with nanoGPT-speedrun settings showed about 8-10% speedup compared to Muon. 1.3B is a larger model than the normal nanoGPT-speedrun setting. Since the effect is small unless the matrix is large, a larger model was probably used.

Conclusion

I’ve covered several studies related to Muon. Personally, I found it interesting that increasing orthogonalization accuracy allows increasing the learning rate. Not only is it interesting, but I feel it has potential to be useful in the future.

There should still be room for improvement in optimizers. According to The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton, when optimizing using the Gauss-Newton method, calculated in terms of steps, loss function values of comparable levels can be obtained in about 1/16 the number of steps as Muon. The Gauss-Newton method computation itself is heavy, so the time for one step execution increases significantly and doesn’t actually become faster, but there’s hope that if better approximation methods can be discovered, faster optimizers can be created. Let’s look forward to optimizer progress in 2026 too.

Looking for Work

We welcome any projects involving machine learning. If you’re interested, please contact us through our contact form.

Share: