FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16
FP4 training was a research curiosity in 2024. By 2026 it ships in production frontier models. What changed and what tradeoffs remain.
The Headline
DeepSeek V4 (March 2026) is the first publicly described frontier model trained substantially in FP4. NVIDIA Blackwell's tensor cores accelerate FP4 at twice the rate of FP8 and four times BF16. The arithmetic of training cost finally pushed the industry past FP16 as the default for new pretraining.
This piece walks through what FP4 training actually means, how teams are doing it without quality regressions, and what is still a moving target.
Mixed-Precision Training Refresher
flowchart LR
Fwd[Forward pass<br/>FP4 weights/activations] --> Loss
Loss --> Bwd[Backward pass<br/>FP4 gradients]
Bwd --> Master[FP32 master weights<br/>updated by optimizer]
Master --> CastF[Cast back to FP4 for next step]
You do not train end-to-end in FP4. The standard recipe in 2026:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Forward and backward pass: FP4 (specifically MXFP4 with E2M1 elements and E8M0 block scales)
- Activations and gradients: MXFP6 or MXFP8 in critical layers
- Master weights: still kept in FP32 or BF16 by the optimizer
- Optimizer state: BF16 or FP8 with stochastic rounding
The result: about 2x the throughput of FP8, roughly 4x BF16, while staying within 0.5 percent of BF16 quality on standard benchmarks.
Why This Required New Tricks
Naive FP4 training diverges. Activations and gradients have wide dynamic ranges that 4 bits cannot represent. The patterns that made it work in 2025-2026:
- Microscaling block sizes tuned per tensor: not all tensors tolerate the same block size. DeepSeek V4 uses block sizes from 16 to 128 depending on tensor type.
- Stochastic rounding in the FP4 cast prevents systematic drift
- Selective higher-precision layers: embeddings, layer norms, and the final classifier head stay BF16
- Loss scaling adapted for FP4 dynamic range — a refinement of the older FP16 loss-scaling trick
- Outlier handling: per-tensor outlier clipping or dedicated higher-precision storage for known outlier dimensions
The DeepSeek V4 Recipe
DeepSeek V4 published technical details in their Q1 2026 paper. Key points:
- Pretraining done substantially in FP4 (with critical components in higher precision)
- ~14 trillion tokens of training data
- Mixture-of-Experts with FP4 expert weights
- Multi-token prediction objective (related to but different from speculative decoding)
- Total training compute reported substantially below comparable Llama-class models
Independent reproductions of parts of the recipe by Tsinghua and HuggingFace teams have validated that FP4 training is broadly reproducible — not a one-off.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Hardware
flowchart TB
H100[H100 BF16/FP8] --> Old[Older training]
H200[H200 FP8 native] --> Mid[2024-2025 mainstream]
B200[Blackwell B200<br/>FP4 native] --> New[2026 frontier]
MI355[AMD MI355X<br/>FP4 native] --> NewAMD[2026 alternative]
Blackwell's FP4 tensor cores are the production hardware enabling this in 2026. AMD's MI355X added FP4 support and is closing the gap. Older H100 fleets cannot do FP4 natively — they emulate it slowly. The capex shift toward Blackwell is partly motivated by FP4 economics.
What Still Doesn't Fit
- Very small models: under ~3B parameters, FP4 training quality regressions are larger relative to BF16; the dollar savings are also smaller
- Tasks with extreme tail dependence: math benchmarks and hard reasoning still show ~1 point regressions in some FP4 trainings; for the highest-quality math models, BF16 weights are still preferred
- RL fine-tuning: PPO and GRPO fine-tunes are sensitive; many teams keep RLHF in BF16 even when pretraining was FP4
What This Means for Practitioners
If you are pretraining a frontier model in 2026, FP4 is the default path on Blackwell hardware. If you are fine-tuning or doing post-training, the choice depends on framework support — most frameworks (Megatron-LM, NeMo, TorchTitan) support FP4 mixed-precision; some (smaller research libraries) do not yet.
For inference, FP4 weights are essentially free quality-wise for chat and agentic workloads. They are now the default in production.
Sources
- DeepSeek V4 technical report — https://github.com/deepseek-ai
- "FP4 training in practice" NVIDIA — https://developer.nvidia.com/blog
- OCP Microscaling specification — https://www.opencompute.org
- "Microscaling formats for AI" research — https://arxiv.org/abs/2310.10537
- "FP8 Formats for Deep Learning" Micikevicius et al. — https://arxiv.org/abs/2209.05433
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.