KV-Cache Offloading Strategies: CPU, GPU, and NVMe Tradeoffs in 2026
KV-cache is the dominant memory cost in long-context inference. The 2026 offloading strategies that make 1M-token serving practical.
Why KV-Cache Dominates Memory
When an LLM generates a token, it needs to attend to every prior token. The keys and values for those prior tokens are cached so they do not need to be recomputed. At long contexts, the cache becomes the dominant memory consumer — often more than the model weights themselves.
For a Llama-class 70B model at 128K context with batch size 1, the KV-cache is roughly 16-20 GB depending on attention head count and dtype. At 1M context the cache is 130 GB. Offloading is no longer optional.
Where KV-Cache Can Live
flowchart LR
Hot[Hot tier: GPU HBM<br/>fastest, smallest] --> Active[Active tokens]
Warm[Warm tier: CPU RAM<br/>fast over PCIe / CXL] --> Recent[Recent tokens]
Cold[Cold tier: NVMe<br/>slow, large] --> Old[Old tokens]
NVL[Cross-GPU NVLink<br/>distributed cache] --> Shared[Shared cache]
Four locations, very different speeds and costs:
- GPU HBM: ~3 TB/s, 80-192 GB per card, expensive
- CPU RAM via PCIe: ~64 GB/s, hundreds of GB, cheap
- CPU RAM via CXL or Grace-Hopper coherent link: ~900 GB/s, hundreds of GB, available on specific hardware
- NVMe: ~14 GB/s, tens of TB, very cheap
The Strategies That Work
Static Layer Sharding
Half the layers' KV stays on GPU, half on CPU. Predictable, simple, but every token incurs PCIe traffic. Used in older offloading schemes.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Token-Aged Eviction
Recent tokens stay hot; older tokens migrate to CPU. Combined with prefix-attention patterns this saves substantial bandwidth — most attention concentrates on recent tokens.
Streaming KV with Prefix Caching
For multi-turn conversations, the system prompt and conversation prefix are cached on disk and copied to GPU on demand. This is the foundation of "prompt caching" — Anthropic, OpenAI, and Google all implement variants. The cost reduction for repeated long prompts is 5-10x.
Distributed KV across NVLink
For multi-GPU deployments, NVLink-connected GPUs can share a single logical KV-cache. NVIDIA's NVLink Switch on Blackwell lets 72 GPUs share a 13 TB pool of KV at near-HBM speeds. This is the path for million-token serving at scale.
A Modern Inference Server Stack
flowchart TB
Req[Request] --> S[Scheduler]
S --> Loc[Locate prefix in cache]
Loc -->|Hit, on GPU| Run[Run]
Loc -->|Hit, on CPU| Move1[Move to GPU]
Loc -->|Hit, on NVMe| Move2[Stream from NVMe]
Loc -->|Miss| Comp[Compute prefix]
Move1 --> Run
Move2 --> Run
Comp --> Run
vLLM with PagedAttention, SGLang with RadixAttention, and TensorRT-LLM all implement this hierarchy in 2026. The gains over flat KV management are large: 3-10x throughput on conversational workloads.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Bandwidth Math
A 70B BF16 model at 128K context per request needs roughly 18 GB of KV. To run 8 concurrent users you need 144 GB of KV — already over a single H100's 80 GB. The choices:
- 2x H100 with tensor parallelism (cost: 2 H100s)
- 1x H200 (141 GB) with offloading
- 1x H100 with CPU offload over PCIe
PCIe Gen 5 x16 gives 64 GB/s. A 144 GB working set needs to be carefully sliced; full-cache transfers per token are not feasible. Token-aged eviction is what makes this practical.
What This Looks Like in vLLM 2026
vLLM 0.7+ ships:
- PagedAttention as the GPU primitive (paged blocks instead of contiguous tensors)
- CPU offload of older blocks
- Disk offload (experimental as of early 2026)
- Prefix caching with deduplication across requests
For most deployments, enabling --enable-prefix-caching and sizing CPU swap is the entire performance optimization. The system handles the rest.
Where Offloading Hurts
- Very high concurrency: many users with many small KVs thrash; static partitioning beats hierarchical
- Latency-sensitive workloads: any cache miss adds tens to hundreds of milliseconds; for sub-second SLAs you keep KV resident
- Streaming voice agents: voice latency budgets are too tight for any cold-fetch path; size HBM to keep all active KV resident
Sources
- vLLM PagedAttention — https://blog.vllm.ai
- SGLang RadixAttention — https://lmsys.org/blog
- TensorRT-LLM KV-cache management — https://nvidia.github.io/TensorRT-LLM
- "Efficient KV cache offloading" 2025 paper — https://arxiv.org/abs/2403.06504
- NVIDIA Blackwell NVLink Switch — https://www.nvidia.com/en-us/data-center/blackwell
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.