Diffusion LLMs Arrive: LLaDA, Mercury, and the End of Left-to-Right Generation
Diffusion-based LLMs like LLaDA and Mercury generate text in parallel rather than left-to-right. The 2026 production picture.
The Departure From Autoregressive Generation
Almost every LLM since 2018 has been autoregressive: generate one token, attend to all prior tokens, generate the next. Diffusion LLMs flip this: start from a noisy, masked sequence and progressively denoise it in parallel. By the time the iterative denoising completes, you have the full output.
LLaDA (Renmin/Tsinghua, 2024) and Mercury (Inception Labs, 2025-2026) shipped public models that operate this way. Their production use is growing in 2026. This piece walks through how they work and where they fit.
How a Diffusion LLM Generates
flowchart LR
Start[Fully masked output] --> Step1[Step 1: predict 30% of tokens]
Step1 --> Step2[Step 2: predict another 30%]
Step2 --> Step3[Step 3: predict remaining]
Step3 --> Final[Final output]
A diffusion LLM starts with all positions masked. Across N denoising steps, it predicts subsets of positions. At each step, multiple tokens get filled in in parallel. Total compute is similar to autoregressive but the work is parallelizable across positions.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Why This Matters
- Parallel generation: many tokens can be generated in one step, reducing wall-clock latency
- Bidirectional context: the model conditions on tokens being generated in both directions, not just left
- Editing flexibility: changing a generated word naturally re-runs the diffusion conditional on the edit
The first point is the biggest production win. Mercury and LLaDA report 2-5x throughput improvements at comparable quality on certain tasks.
What They're Good At
- Long-form generation where many tokens are routine
- Code generation (Mercury Coder reports very strong throughput numbers)
- Editable / controllable outputs (you can edit a generated token and re-diffuse around it)
- Constrained outputs where bidirectional context helps
Where They Underperform
- Very high-quality reasoning (autoregressive frontier still leads)
- Complex tool use (less ecosystem maturity)
- Streaming output (diffusion does not naturally stream the way autoregressive does)
Mercury and LLaDA Specifics
Mercury (Inception Labs)
Inception Labs's Mercury family includes:
- Mercury Coder: code-focused diffusion LLM, claims 5-10x throughput at comparable benchmarks
- Mercury Chat: general-purpose diffusion LLM for chat workloads
- Public API access since late 2025
LLaDA
LLaDA was the first major open-weights diffusion LLM. It demonstrated parity with similarly-sized autoregressive models on standard benchmarks. Open-weights, mid-sized parameter counts. Several research groups have built on it in 2025-26.
When You Might Use One
flowchart TD
Q1{High-throughput<br/>long-form generation?} -->|Yes| Diff[Try diffusion]
Q1 -->|No| Q2{Streaming UI<br/>required?}
Q2 -->|Yes| AR[Stay autoregressive]
Q2 -->|No| Q3{Editable<br/>structured output?}
Q3 -->|Yes| Diff2[Diffusion fits]
Q3 -->|No| AR2[Autoregressive likely]
For most agent and chat workloads in 2026, autoregressive is still the right choice. For code generation at scale and certain document-generation workloads, diffusion is competitive on throughput.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Open Questions
Three things diffusion LLMs have not yet resolved:
- Reasoning depth: top-of-leaderboard reasoning benchmarks are still autoregressive
- Tool use: ecosystem is less mature; native tool calling is an active research area
- Cost economics at small batch: diffusion's parallel advantage shrinks when batch size is small
The expected 2026-2027 picture: diffusion captures specific high-throughput workloads while autoregressive remains the default for general agents and chat.
Adopting Cautiously
If you are evaluating diffusion LLMs for production in 2026:
- Benchmark on your actual task, not just public benchmarks
- Measure end-to-end latency including any pipeline differences (no streaming)
- Verify the ecosystem support for your stack (frameworks, observability)
- Have an autoregressive fallback for tasks where diffusion underperforms
Sources
- LLaDA paper — https://arxiv.org/abs/2502.09992
- Inception Labs Mercury — https://www.inceptionlabs.ai
- "Diffusion language models" survey 2024 — https://arxiv.org/abs/2401.07953
- "Discrete diffusion" Sahoo et al. — https://arxiv.org/abs/2406.03736
- "DiffuSeq" Gong et al. — https://arxiv.org/abs/2210.08933
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.