Do You Know About NanoGPT Speedrun?

This article is a machine translation from Japanese. It may contain translation errors.

I’m Tokunaga, the CEO of PredNext. My Claude Code article written to go viral didn’t get any traction at all, so while I want to write an even more blatantly crowd-pleasing article, I have no idea how to make something go viral. For now, let’s talk about the NanoGPT speedrun.

What is NanoGPT Speedrun?

NanoGPT speedrun is an unofficial performance competition aimed at accelerating training, based on nanoGPT, a small transformer implementation released by Andrej Karpathy.

NanoGPT was originally designed for educational purposes and while the code is readable, it’s not suitable for practical large-scale training. However, its simplicity makes it easy to modify, and leveraging this, a person named Keller Jordan started this speedrun.

The rules specify that NVIDIA H100x8 is the upper limit on compute resources, and competitors race to see how quickly they can reach a validation loss of 3.28 or below on a dataset called FineWeb. Training that initially took 45 minutes in May 2024 has been dramatically accelerated in less than a year, with the record as of July 13, 2025 being 2.863 minutes.

The journey so far has been extremely interesting in every aspect, but this time I’d like to focus on an optimizer called Muon.

What is Muon?

Muon is an optimization method developed by Keller Jordan, who organized the speedrun. It calculates the moving average of gradients (what’s called momentum), approximately orthogonalizes the momentum using a technique called Newton-Schulz iteration, and uses the result to update parameters.

Since it performs matrix orthogonalization, it can naturally only be used for layers where parameters are matrices (Linear layers and convolutional layers). It’s recommended to use AdamW for parameter updates of other parameters.

As a reason why orthogonalization accelerates convergence, Jordan’s blog states that the scale of “rare directions” that are important information in training may be amplified. Orthogonalization normalizes each column’s norm to 1, so this is thought to be important.

What’s exciting about Muon is that less than a year after being proposed, [2502.16982] Muon is Scalable for LLM Training demonstrated its practicality with 3B/16B-parameter MoE models, and it was adopted as the optimizer for Kimi K2, a 32B/1T-parameter MoE model released in July 2025. (ref: Kimi K2: Open Agentic Intelligence) Considering that many optimizers have disappeared without becoming mainstream, this can be called a remarkable achievement.

While Kimi K2 doesn’t currently have as much attention as DeepSeek-R1, it seems to be receiving quite high evaluations from people who have actually used it. Many people seem to have the impression that it’s comparable to Claude Sonnet 3.5, or perhaps a bit better. Since Claude Sonnet 3.5 is quite practical as a coding model, it can be said that Kimi K2 has also reached a practical level as a coding LLM.

There’s also talk that fine-tuning a model trained with Muon using AdamW, or vice versa, doesn’t work very well (see 3.5.1 Ablation Studies on the Interchangeability of Pretrain and SFT Optimizers), but still, if convergence is nearly 50% faster than AdamW, this can’t be ignored. Developers will have to consider using Muon, and cases of adopting Muon will likely increase in the future.

So that was an introduction to NanoGPT speedrun, and an introduction to the fact that new technologies are actually emerging from it.

Unfortunately, NanoGPT speedrun has received almost no attention in Japan. Unlike Kaggle, there’s no prize money, and you need to procure 8 H100s to benchmark (since servers with H100x8 are almost certainly not available in households, you’ll need to temporarily rent them using Modal or similar services, and for example, renting H100x8 on Modal for one hour costs about $32. Note that while the speedrun execution takes 3 minutes, torch.compile takes a bit over 10 minutes as preparation, so one experiment requires about 15 minutes in total.), which might be the bottleneck, but I hope more people will take on the challenge.

However, as can be inferred from the recent decrease in record updates, breaking this speedrun’s record is really difficult. You probably shouldn’t expect to break the record with just a bit of tuning.

Bonus

Naturally, I also tried this speedrun challenge. I tried several methods but they didn’t work out. As a memorial, here are the things that didn’t work:

Making Muon Cautious Optimizer. It seems unavoidable that it wouldn’t work since orthogonalization seems to change the direction of gradient and momentum.
Using DiLoCo. It was good with 1 GPU, but there was a problem where it slowed down as GPUs increased. Muon executes matrix multiplication several times during parameter updates, but when multiple GPUs are available, there’s an ingenuity of distributing this matrix multiplication across multiple GPUs layer by layer. The compatibility of this with DiLoCo (inner optimizer must not communicate) was bad, and while convergence when compared with the same number of steps was faster, the time per step increased, worsening total convergence performance.
Using Dynamic Tanh for normalization. Writing it in PyTorch made it very slow, so I tried writing a kernel in Triton, but still couldn’t beat RMS norm in either accuracy or speed. I think accuracy might have improved with more serious hyperparameter tuning, but I couldn’t explore because it costs money.
Changing the combination of long attention and short attention. Doing it casually definitely slows convergence. The person who found this combination either repeated massive experiments or was very lucky—either way, they’re amazing.
Converting MLP layers to fp8. Perhaps type conversion to fp8 takes time, but it just became slower.

Summary

NanoGPT speedrun is wonderful. As an implementation for LLM training that can be done on one server, it’s so refined that further speedup is difficult. You could try challenging the speedrun record update, or use this code as a foundation to tackle other challenges.

Looking for Work

PredNext is currently accepting a few more project requests. Our specialty is AI-related technologies centered on natural language processing and image processing, with a particular focus on model compression and optimization. If you’re interested, please contact us through our contact form.

Do You Know About NanoGPT Speedrun?

What is NanoGPT Speedrun?

What is Muon?

Bonus

Summary

Looking for Work

Related Posts

Between Diffusion Language Models and Autoregressive Models

Do AI Systems Dream of Diffusion Language Models?

When Will the Era of 1-bit LLMs Come, Or Will It?