Between Diffusion Language Models and Autoregressive Models

This article is a machine translation from Japanese. It may contain translation errors.

Hello, I’m Tokunaga from PredNext.

Since my previous explanation of diffusion language models only covered the basics, today I’d like to cover the latest topics that weren’t included last time.

Diffusion Language Models Can Use the Same Data Repeatedly for Training

For autoregressive models, it’s generally said that the same data can be used for training about 4 times. (ref: Scaling Data-Constrained Language Models) Now, what would this number be for diffusion language models?

In Gao et al.’s What Makes Diffusion Language Models Super Data Learners?, they experimentally show that diffusion language models can use the same data at least about 100 times for training. As a reason for this high repetition tolerance, they speculate that diffusion language models can be viewed as applying dropout regularization to inputs, and this regularization effect is at work.

To further verify this hypothesis, they introduced various forms of dropout to the autoregressive model side, applied stronger weight decay, and investigated how validation loss and downstream task accuracy changed. Using Qwen3-0.6B as the model architecture, they show that while the normal autoregressive model sees validation loss increase midway through training, the training method with stronger regularization continues to reduce validation loss and also improves downstream task accuracy. It’s particularly interesting that simply setting the Weight Decay value to a relatively large 0.5 can prevent overfitting.

On the other hand, in Prabhudesai et al.’s Diffusion Beats Autoregressive in Data-Constrained Settings, they train an autoregressive model with dropout for 50 epochs, similar to Gao et al.’s paper, but report that as a result, compared to a diffusion language model trained for 500 epochs, validation loss remained high and they couldn’t prevent overfitting. This seems to contradict Gao et al.’s report, but looking at Figure 8, it doesn’t appear that the autoregressive model’s validation loss continues to rise. It seems reasonable that if the autoregressive model were also trained for 500 epochs, validation loss would decrease to a similar level. There’s code available, so I’d like to reproduce it, but since one experiment costs quite a bit of money, we haven’t been able to reproduce it at our company…

Additionally, Diffusion Language Models are Super Data Learners conducts experiments on repetition tolerance under multiple settings.

In the first experiment, while keeping the total amount of training data fixed, only the amount of unique data is varied. When the amount of unique data is small, the same sample is used many times, increasing the number of data repetitions. In this setting, when the number of repetitions is high, the autoregressive model saw validation loss increase at a relatively early stage. Meanwhile, the diffusion language model continued to decrease validation loss even when using the same data about 200 times, and the final validation loss was also lower for the diffusion language model. When the number of data repetitions is low, the autoregressive model also continued to decrease validation loss, and furthermore, the autoregressive model’s validation loss consistently maintained lower values than the diffusion language model.

Next, they compare cases where model size is increased from 1B to 8B models while keeping the number of data repetitions fixed at about 100. For autoregressive models, the larger the model, the greater the increase in validation loss, but no such trend was observed for diffusion language models, and validation loss continued to decrease within a reasonable range for all model sizes. From this, we can say that as model degrees of freedom increase, autoregressive models become more prone to overfitting, but for diffusion language models, at least within the range tested, no such tendency was observed.

From these papers, the following can be roughly understood:

If you need to use the same data dozens of times repeatedly, diffusion language models are advantageous (final validation loss also decreases, and downstream task accuracy is better)
Under other conditions (= sufficient data, but insufficient computing resources), autoregressive models are advantageous
With clever regularization, autoregressive models can also significantly increase the number of times data is reused

Points 1 and 2 can be understood as “I see,” in a sense, but point 3 is actually a quite significant discovery. Many people may have heard somewhere that “data for AI training will run out in the next few years.” For example, in the article Will we run out of data to train large language models? published by Epoch AI, they predict that data depletion will occur by around 2027, even considering about 4 repetitions. A perspective piece likely inspired by this work has also been published in Nature. If the number of data reuse times can be increased to about 100, the timing of data depletion could be significantly delayed. Even if the required amount of data grows at an annual rate of 2.4x, it would be delayed by about 3.5 years.

Also, while the papers only verify repetition up to about 200 times, if the number of repetitions can be increased to about 1000 times, the quality of data becomes more important than quantity. While there has always been talk about the importance of data quality, it will likely be emphasized even more.

These studies have hardly been mentioned, perhaps because they haven’t yet passed peer review, but if Gao et al.’s claims are correct, they have a high potential to become important research that overturns conventional wisdom.

Combining Autoregressive and Diffusion Language Models

An interesting recent application of diffusion language models is combining them with autoregressive models.

TiDAR: Think in Diffusion, Talk in Autoregression is an attempt to obtain better results by combining with autoregressive models, rather than simply using diffusion language models alone. After training an autoregressive model, they additionally train it as a diffusion language model using those parameters as initial values, and by using the resulting model as a draft model in speculative decoding, they report achieving about 4-5x speedup in decoding.

In typical speculative decoding, it’s common to use a smaller model than the original model as the draft model. The feature of TiDAR is that by using a diffusion language model, they use a model of the same size as the original model for draft generation, and by increasing the match rate, they achieve faster decoding than conventional methods. Other interesting implementations include methods for making draft generation itself more speculative.

Looking at the experimental results in Table 2, the TiDAR model created through additional training based on Qwen2.5 1.5B hardly loses performance on downstream tasks; rather, performance actually improves on several tasks (HumanEval, HumanEval+, MBPP+). However, the TiDAR model created based on Qwen3 8B shows performance degradation on all downstream tasks. Further research progress is awaited in this area.

Why Do Diffusion Language Models Often Show Slightly Underwhelming Results?

Looking at TiDAR’s results too, there are many cases where using diffusion language models causes downstream task performance to drop slightly compared to autoregressive models, and it’s difficult to definitively say that diffusion language is superior.

I personally think this is mainly a problem with training methods. LLMs require fine-tuning through SFT and RL after pretraining, but diffusion language models like LLaDA and Dream 7B only perform relatively simple SFT. On the other hand, Qwen2.5, which serves as the base for both, performs SFT using over 1 million data points, then layers multiple stages of reinforcement learning. This is thought to contribute significantly to downstream task performance, and I expect that if fine-tuning including similar reinforcement learning is implemented, there is ample room for diffusion language models to show higher performance.

It wouldn’t be surprising if several studies that properly verify this appear within about the next year.

Summary

In this article, I introduced several topics around diffusion language models that weren’t covered in the previous article. I hope this article has somewhat increased readers’ resolution on diffusion language models and sparked greater interest.

Looking for Work

As of December 2025, we still have some availability in our schedule. We welcome any machine learning projects, not limited to diffusion language models. If you’re interested, please contact us through our contact form. We look forward to hearing from you.

Between Diffusion Language Models and Autoregressive Models

Diffusion Language Models Can Use the Same Data Repeatedly for Training

Combining Autoregressive and Diffusion Language Models

Why Do Diffusion Language Models Often Show Slightly Underwhelming Results?

Summary

Looking for Work

Related Posts

Do AI Systems Dream of Diffusion Language Models?

Do You Know About NanoGPT Speedrun?

When Will the Era of 1-bit LLMs Come, Or Will It?