Do AI Systems Dream of Diffusion Language Models?

This article is a machine translation from Japanese. It may contain translation errors.

Hello, I’m Tokunaga from PredNext. I originally intended to update about once a week, but updates have been significantly delayed. If I focus too much on quality with my current schedule, I’ll only be able to write 1–2 articles a year, so from now on I’ll try to update quickly without being too particular about the details.

But I digress—in 2025, diffusion language models such as Mercury Coder, Gemini Diffusion, and Dream 7B have been getting some attention. While it’s still hard to say that practical use cases have emerged, it’s fair to say they’re attracting interest.

The current LLM model architecture, Transformer (more precisely, autoregressive model-type Transformers), has a fundamental problem in terms of execution efficiency. Particularly when generating responses to just one prompt, effective efficiency tends to be low. Diffusion language models are gaining attention as a means to solve this problem.

I’ve been receiving more questions about diffusion language models recently. However, when trying to explain their benefits to those not very familiar with hardware, the explanation is surprisingly difficult. Therefore, in this article, I’ll explain to those not very knowledgeable about computer hardware why diffusion language models are attracting attention, incorporating the keyword B/F. When thinking about LLM inference, computer utilization efficiency is an unavoidable topic. This content is especially intended for those who say “I’m not very familiar with hardware.” For those who know hardware a bit, it might be a little simple, but I’d appreciate if you could read it as a review.

Why Is Transformer Inference Slow?

As of 2025, almost all practical LLMs are based on an architecture called Transformer. The inference method in current Transformers is called the autoregressive model (Auto Regressive Model). In autoregressive models, when generating text, they predict the next token (= word, symbol, etc.), use that prediction result as the next input, and predict the next token again… repeating this process. To generate a token at time $t$ , you need the output token from the previous time $t-1$ . While this is an obvious constraint, it poses the troublesome problem that “computation at $t$ cannot begin until computation at $t-1$ is finished,” making parallelization difficult.

Despite such problems, autoregressive models have an overwhelming advantage: high output quality. This is a really big advantage, and it can’t be emphasized enough.

Now, returning to the problem, in autoregressive models, each time a token is generated, large model data must be loaded from memory (DRAM) to the processor (=GPU or TPU, etc.). With model sizes getting ever larger these days, this is a major drawback. For example, to generate one token using the Llama 3.3 70B model, even with 8-bit quantization, you need to load 70GB of data each time. To generate 10 tokens, that’s 10 times more—700GB of data must be read. It may be hard to believe, but this happens because the same data is loaded from memory to the processor over and over, used briefly for calculation, and immediately discarded (discarded = overwritten with data needed for the next calculation). For reference, the NVIDIA H100’s memory bandwidth is 3.35~3.9TB/s, so it’s easy to imagine that model loading becomes the bottleneck.

In technical terms, autoregressive model-type Transformers have a high B/F (Bytes per FLOP) value during inference, and cannot be executed efficiently on modern computers where processors and memory are separated. Now, the technical term B/F has suddenly appeared. Let’s explain B/F in the next section.

High B/F Computation Is Painful

B/F is an abbreviation for Bytes per FLOP, a value representing how many bytes of data need to be read and written to perform one operation. Here, FLOP (FLoating point OPeration) refers to one operation such as floating-point addition or multiplication. Expressed as an equation, B/F = transferred bytes / FLOP. For example, if B/F is 1, the data read and written to perform one calculation (e.g., addition) is 1 byte. Incidentally, the data counted in B/F is the amount of data flowing between the processor and memory. In other words, if the processor has a cache and reuses data, the reused data is not counted in B/F.

Processors and memory (so-called DRAM) have different manufacturing processes, so they inevitably need to be made as separate chips. It’s not absolutely impossible to build them into the same chip (Compute In Memory is a major enough research field to exist), but there’s the demerit that computing power drops considerably. As far as I know, it’s impossible as of 2025 to co-locate processors and memory on the same chip without degrading processor performance.

Here’s the essence of the problem: exchanging data between separate chips is extremely difficult. Within the same chip, wiring to move data can be arranged at high density, but when wiring between chips, it’s impossible to wire at the same density as within chips, inevitably resulting in a significantly smaller number of wires. Therefore, computers have always been fundamentally bad at high B/F computation.

This is a bit of a historical aside, but as for how long DRAM has been used, DRAM was conceived in 1966, and Intel started manufacturing DRAM in 1970, so DRAM seems to have spread in the mid-1970s. And before DRAM spread, magnetic core memory was used, so computers must have been bad at high B/F computation for more than 50 years. That’s why B/F has long been an extremely important metric for software execution efficiency.

After explaining this far, I can finally continue the story from the previous section. The inference computation of autoregressive model-type Transformers has a very high B/F value due to the need to load the model every time to generate one token, and the bottleneck is memory bandwidth. Therefore, expanding memory bandwidth improves processor utilization efficiency by the amount expanded, but expanding memory bandwidth costs a great deal. NVIDIA’s datacenter GPUs achieve wide bandwidth by using a special DRAM called HBM (High Bandwidth Memory), but these HBM chips are considerably more expensive than ordinary DRAM. When using HBM, the processor and memory are arranged in very close physical proximity, and chip-to-chip wiring is performed at an unimaginably high density. This requires a technology called through-silicon via and a component called silicon interposer. HBM itself is expensive, and the components for connection are also expensive. Moreover, even with that HBM, B/F is too high and memory bandwidth is the bottleneck. The recent skyrocketing stock prices of SK Hynix and Micron are because they manufacture this HBM.

How Diffusion Models Work

Let’s return to diffusion language models. What’s great about diffusion language models is that B/F is overwhelmingly lower than autoregressive models.

It’s difficult to explain the details of diffusion language models right away, so let’s first explain diffusion models, which are the basis of diffusion language models. Diffusion models consist of the following two processes:

Diffusion process (adding noise): Gradually add noise to original data (images, etc.), eventually making it completely noise.
Reverse process (removing noise): Use a neural network to slightly remove noise from noisy data.

When you input noise into an image generation diffusion model with a well-trained reverse process, you can extract wonderful images. It’s hard to believe at first glance that images corresponding to noise are generated, but in fact, using image generation models like Stable Diffusion, photo-like images can be generated.

Applying Diffusion Models to LLMs

Applying this idea of removing noise to LLMs is the diffusion language model. Since inputs and outputs are discrete values, they’re also called discrete diffusion language models. (The concept of continuous diffusion language models also exists, but won’t be covered in this article.)

In discrete diffusion language models, the diffusion process and reverse process are as follows:

Diffusion process: Randomly replace some words in a sentence with a special token (hereafter denoted as [MASK] token).
Reverse process: Input a sentence mixed with [MASK] tokens to a model like Transformer and have it predict the tokens at the [MASK] positions.

The “adding noise” and “removing noise” processes from diffusion models have just changed to “masking” and “removing masks.”

During training, the model is trained to succeed at the reverse process. There are various tricks to stabilize training, but this is the basic approach.

The characteristic of diffusion language models is that multiple [MASK] tokens can be input in the reverse process. Therefore, by inputting multiple [MASK] tokens at once, multiple tokens can be generated.

As vaguely written above, the network architecture being trained is an existing LLM like Transformer. Although called diffusion language models, there isn’t a new architecture called diffusion language model—diffusion model-derived ideas are just incorporated into the training and inference parts. The concept contrasted with diffusion language models is autoregressive models, not network architectures like Transformer.

How to Infer with Diffusion Language Models

The general flow when generating text using a diffusion language model is roughly as follows:

Prepare a [MASK] token sequence of the desired generation length.
Input that token sequence to a reverse process network like Transformer.
Among the output sequence, adopt only the tokens where the output value exceeds a threshold. (In addition to the threshold method, there are methods like forcibly adopting the top N items.)
Positions not adopted are returned to [MASK] tokens again and input to the reverse process network again.
Repeat 1-4, gradually replacing with more plausible words until there are no more [MASK] tokens.

Diffusion language models can generate multiple tokens in one application of the reverse process, so the number of model loads can be reduced. In other words, the B/F value can be reduced. For example, if a sentence is completed in 100 iterations, the number of model loads from DRAM is also 100. When generating a sentence of 1000 tokens, this is significantly more efficient compared to autoregressive models requiring 1000 model loads.

Also, as a secondary effect, since multiple tokens are processed in parallel at once, computation is centered on matrix multiplication, efficiently utilizing GPUs’ TensorCores, etc.

Practicality of Diffusion Language Models

Are diffusion language models currently practical? Let’s consider two points: output quality and speed.

Regarding quality, models above a certain level have been trained recently, and proper evaluations have been conducted. In Large Language Diffusion Models (LLaDA), experiments using several downstream tasks showed that 8B-class diffusion language models achieve quality similar to autoregressive models of equivalent scale.

Regarding speed, diffusion language models aren’t always faster than autoregressive models—as the number of iterations increases, diffusion language models become slower. Where the threshold is changes depending on hardware performance (computational performance and memory bandwidth) and model size, so let’s assume a model size of 70B and hardware of NVIDIA H200. Let’s set the output length at 1024 tokens.

First, let’s think about autoregressive model computation time. According to Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding, the inference performance of Llama 3.3 70B on H200 is about 51 tokens/s (TPS). Assuming an output of 1024 tokens, output takes about 20 seconds.

Since the model is loaded about 51 times per second, 51 × 70 ≈ 3.6TB/s of data reading is necessary. Since NVIDIA H200’s memory bandwidth is 4.8TB/s, it’s clear that a considerable portion of memory bandwidth is used for model loading. Considering that memory bandwidth is also needed for reading and writing the KV cache, it’s fair to say the bottleneck is memory bandwidth.

For diffusion language models, the bottleneck is the processor. The computational amount needed to output one token with a 70B model is about 140GFLOP (rough estimate, but about 2FLOP/parameter), so the computational amount for outputting 1024 tokens is about 140TFLOP. This is the computational amount for calculating the reverse process once, and since NVIDIA H200’s fp8 dense matrix multiplication computational performance is about 2000TFLOP/s, this reverse computation could, in theory, run about 14 times per second. In practice, running the arithmetic units at full capacity is difficult for various reasons, so if it becomes about 70% performance, about 10 iterations per second can be estimated as possible. Since autoregressive models take 20 seconds for output, the number of iterations where the two approaches break even is about 200.

200 iterations seems like quite a margin, but according to Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding, autoregressive models can be accelerated about 3x by introducing speculative decoding. Considering that it needs to be about 3x faster than that to make adopting diffusion language models worthwhile, 200/9 = about 22 iterations would be the actually permissible number of iterations. Thinking about it this way, whether diffusion language models are really fast gradually becomes unclear.

What is the actual number of iterations for diffusion language models? In Dream 7B: Diffusion Large Language Models, they report that Dream 7B, a diffusion language model, surpassed Qwen 2.5 7B in both accuracy and speed when the number of iterations was set to 5-20. However, verification was only performed on one type of task: Countdown. In Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing, they report that Dream 7B with their proposed D2F method applied surpassed LLaMA3-Instruct-8B in both speed and score on two types of tasks: GSM8k-4-shot and MBPP-3-shot. However, it lost significantly in score to Qwen2.5-Base-7B.

Since scores change greatly depending on the computing resources available for training, and large-scale experiments cost enormous amounts of money these days, fair comparison of diffusion language models and autoregressive models isn’t simple. However, at least at this point, the situation is such that we can’t say “they greatly surpass autoregressive models in both output quality and speed.”

Summary

The three main points of this article are as follows:

Autoregressive models have high B/F, and memory bandwidth tends to be the bottleneck on modern GPUs
Diffusion language models can generate multiple tokens in one inference, so there’s room to reduce the number of model loads and reduce B/F
However, as the number of iterations increases, the time for computation ends up extending, so it can’t be definitively stated that they always surpass autoregressive models accelerated by speculative decoding, etc.

As stated above, it can’t be conclusively determined which is superior between autoregressive models and diffusion language models. Also, diffusion language models still have non-obvious problems remaining, such as what to do about routing when adopting Mixture of Experts. With various issues remaining in the current situation, I can’t definitively say that diffusion language models will become the mainstream inference method of the next generation, but it’s also an unchangeable fact that inference by autoregressive models is fundamentally inefficient. As the author, I wrote this article betting that diffusion language models will come. I’m looking forward to looking back at this article in a few years.

Since this article became quite long with just the explanation of principles, I’ll introduce interesting recent research related to diffusion language models in the next article. I’ll do my best to get the next article out soon. See you again soon.

Looking for Work

As of November 2025, we’re still accepting some work. We welcome any high-difficulty machine learning projects, not limited to diffusion language models. If you’re interested, please contact us through our contact form.

Do AI Systems Dream of Diffusion Language Models?

Why Is Transformer Inference Slow?

High B/F Computation Is Painful

How Diffusion Models Work

Applying Diffusion Models to LLMs

How to Infer with Diffusion Language Models

Practicality of Diffusion Language Models

Summary

Looking for Work

References

Papers I Plan to Read

Related Posts

Between Diffusion Language Models and Autoregressive Models

Do You Know About NanoGPT Speedrun?

When Will the Era of 1-bit LLMs Come, Or Will It?