Skip to content
Large Language Models
Large Language Models9 min read0 views

FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16

FP4 training was a research curiosity in 2024. By 2026 it ships in production frontier models. What changed and what tradeoffs remain.

The Headline

DeepSeek V4 (March 2026) is the first publicly described frontier model trained substantially in FP4. NVIDIA Blackwell's tensor cores accelerate FP4 at twice the rate of FP8 and four times BF16. The arithmetic of training cost finally pushed the industry past FP16 as the default for new pretraining.

This piece walks through what FP4 training actually means, how teams are doing it without quality regressions, and what is still a moving target.

Mixed-Precision Training Refresher

flowchart LR
    Fwd[Forward pass<br/>FP4 weights/activations] --> Loss
    Loss --> Bwd[Backward pass<br/>FP4 gradients]
    Bwd --> Master[FP32 master weights<br/>updated by optimizer]
    Master --> CastF[Cast back to FP4 for next step]

You do not train end-to-end in FP4. The standard recipe in 2026:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • Forward and backward pass: FP4 (specifically MXFP4 with E2M1 elements and E8M0 block scales)
  • Activations and gradients: MXFP6 or MXFP8 in critical layers
  • Master weights: still kept in FP32 or BF16 by the optimizer
  • Optimizer state: BF16 or FP8 with stochastic rounding

The result: about 2x the throughput of FP8, roughly 4x BF16, while staying within 0.5 percent of BF16 quality on standard benchmarks.

Why This Required New Tricks

Naive FP4 training diverges. Activations and gradients have wide dynamic ranges that 4 bits cannot represent. The patterns that made it work in 2025-2026:

  • Microscaling block sizes tuned per tensor: not all tensors tolerate the same block size. DeepSeek V4 uses block sizes from 16 to 128 depending on tensor type.
  • Stochastic rounding in the FP4 cast prevents systematic drift
  • Selective higher-precision layers: embeddings, layer norms, and the final classifier head stay BF16
  • Loss scaling adapted for FP4 dynamic range — a refinement of the older FP16 loss-scaling trick
  • Outlier handling: per-tensor outlier clipping or dedicated higher-precision storage for known outlier dimensions

The DeepSeek V4 Recipe

DeepSeek V4 published technical details in their Q1 2026 paper. Key points:

  • Pretraining done substantially in FP4 (with critical components in higher precision)
  • ~14 trillion tokens of training data
  • Mixture-of-Experts with FP4 expert weights
  • Multi-token prediction objective (related to but different from speculative decoding)
  • Total training compute reported substantially below comparable Llama-class models

Independent reproductions of parts of the recipe by Tsinghua and HuggingFace teams have validated that FP4 training is broadly reproducible — not a one-off.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Hardware

flowchart TB
    H100[H100 BF16/FP8] --> Old[Older training]
    H200[H200 FP8 native] --> Mid[2024-2025 mainstream]
    B200[Blackwell B200<br/>FP4 native] --> New[2026 frontier]
    MI355[AMD MI355X<br/>FP4 native] --> NewAMD[2026 alternative]

Blackwell's FP4 tensor cores are the production hardware enabling this in 2026. AMD's MI355X added FP4 support and is closing the gap. Older H100 fleets cannot do FP4 natively — they emulate it slowly. The capex shift toward Blackwell is partly motivated by FP4 economics.

What Still Doesn't Fit

  • Very small models: under ~3B parameters, FP4 training quality regressions are larger relative to BF16; the dollar savings are also smaller
  • Tasks with extreme tail dependence: math benchmarks and hard reasoning still show ~1 point regressions in some FP4 trainings; for the highest-quality math models, BF16 weights are still preferred
  • RL fine-tuning: PPO and GRPO fine-tunes are sensitive; many teams keep RLHF in BF16 even when pretraining was FP4

What This Means for Practitioners

If you are pretraining a frontier model in 2026, FP4 is the default path on Blackwell hardware. If you are fine-tuning or doing post-training, the choice depends on framework support — most frameworks (Megatron-LM, NeMo, TorchTitan) support FP4 mixed-precision; some (smaller research libraries) do not yet.

For inference, FP4 weights are essentially free quality-wise for chat and agentic workloads. They are now the default in production.

Sources

## FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 — operator perspective FP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 is the kind of news that lives or dies on second-week behavior. The first benchmark is marketing. The eval suite a week later is the truth. For CallSphere — Twilio + OpenAI Realtime + ElevenLabs + NestJS + Prisma + Postgres, 37 agents across 6 verticals — the bar for adopting any new model or API is unsentimental: does it shorten the inner loop on a real call, or just on a benchmark? ## Base model vs. production LLM stack — the gap that costs you uptime A base model is a checkpoint. A production LLM stack is a whole different artifact: eval gates that fail the build on regression, prompt caching that cuts repeated-system-prompt cost by 40-70%, structured outputs that prevent JSON drift on tool calls, fallback chains that route to a smaller-model retry when the primary times out, and request-side guardrails that cap tool calls per session before the loop spirals. CallSphere runs LLMs in tandem on purpose: `gpt-4o-realtime` for the live call (streaming audio in and out, tool calls inline) and `gpt-4o-mini` for post-call analytics (sentiment scoring, lead qualification, summary generation, and the lower-stakes async work that doesn't need realtime). That split is not a cost optimization — it's a reliability decision. Realtime is optimized for low-latency turn-taking; mini is optimized for cheap, deterministic batch scoring. Mixing them lets each do what it's good at without one regressing the other. The teams that struggle with LLMs in production almost always made the same mistake: they treated "the model" as a single dependency, instead of as a small portfolio of models, each pinned to a job, each behind its own eval suite, each with a documented fallback. ## FAQs **Q: Does fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 actually move p95 latency or tool-call reliability?** A: Most of the time it doesn't, and that's the right starting assumption. The relevant test is whether it improves at least one of: p95 first-token latency, tool-call argument accuracy on noisy inputs, multi-turn handoff stability, or per-session cost. Real Estate deployments run 10 specialist agents with 30 tools, including vision-on-photos for listing intake and follow-up. **Q: What would have to be true before fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 ships into production?** A: The eval gate is unsentimental — a regression suite that simulates real call traffic (noisy ASR, partial inputs, tool-call timeouts) measures four numbers, and a candidate has to win on three of four without losing badly on the fourth. Anything else is treated as a blog post, not a stack change. **Q: Which CallSphere vertical would benefit from fP4 Training: DeepSeek V4, NVIDIA Blackwell, and the End of FP16 first?** A: In a CallSphere deployment, new model and API capabilities land first in the post-call analytics pipeline (lower stakes, async, easy to roll back) and only later in the live realtime path. Today the verticals most likely to absorb new capability first are Sales and Healthcare, which already run the largest share of production traffic. ## See it live Want to see healthcare agents handle real traffic? Walk through https://healthcare.callsphere.tech or grab 20 minutes with the founder: https://calendly.com/sagar-callsphere/new-meeting.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Large Language Models

DeepSeek V4 and the Chinese Open-Model Ecosystem in 2026

DeepSeek V4 anchors a thriving Chinese open-model ecosystem. Qwen, Kimi, Yi, GLM — what each one does and how the ecosystem competes.

Large Language Models

Synthetic Data Pipelines: Magpie, Nemotron, and Self-Taught Data Generation

Synthetic data is now most of the post-training corpus at frontier labs. The 2026 pipelines — Magpie, Nemotron, Self-Taught — and how to build one.

Large Language Models

Mixture of Experts Beyond Sparse: Granite, DeepSeek-MoE, and Mixtral Patterns

MoE evolved beyond simple top-k routing. The 2026 patterns from Granite, DeepSeek-MoE, and Mixtral that make MoE practical at scale.

AI Strategy

Open-Source vs Proprietary AI Funding 2026: Mistral's $830M, Llama 4, and the 27x Cost Gap

Mistral raised $830M debt for 13,800 GPUs. DeepSeek R1 hits GPT-4 reasoning at 27x lower cost. Open-source AI market hit $23B in 2026. Where the funding is shifting.

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

Learn Agentic AI

Preparing Fine-Tuning Datasets: Data Collection, Cleaning, and Formatting

Master the art of building high-quality fine-tuning datasets with practical techniques for data collection, cleaning, deduplication, format validation, and diversity analysis.