Skip to content
Large Language Models
Large Language Models8 min read0 views

KV-Cache Offloading Strategies: CPU, GPU, and NVMe Tradeoffs in 2026

KV-cache is the dominant memory cost in long-context inference. The 2026 offloading strategies that make 1M-token serving practical.

Why KV-Cache Dominates Memory

When an LLM generates a token, it needs to attend to every prior token. The keys and values for those prior tokens are cached so they do not need to be recomputed. At long contexts, the cache becomes the dominant memory consumer — often more than the model weights themselves.

For a Llama-class 70B model at 128K context with batch size 1, the KV-cache is roughly 16-20 GB depending on attention head count and dtype. At 1M context the cache is 130 GB. Offloading is no longer optional.

Where KV-Cache Can Live

flowchart LR
    Hot[Hot tier: GPU HBM<br/>fastest, smallest] --> Active[Active tokens]
    Warm[Warm tier: CPU RAM<br/>fast over PCIe / CXL] --> Recent[Recent tokens]
    Cold[Cold tier: NVMe<br/>slow, large] --> Old[Old tokens]
    NVL[Cross-GPU NVLink<br/>distributed cache] --> Shared[Shared cache]

Four locations, very different speeds and costs:

  • GPU HBM: ~3 TB/s, 80-192 GB per card, expensive
  • CPU RAM via PCIe: ~64 GB/s, hundreds of GB, cheap
  • CPU RAM via CXL or Grace-Hopper coherent link: ~900 GB/s, hundreds of GB, available on specific hardware
  • NVMe: ~14 GB/s, tens of TB, very cheap

The Strategies That Work

Static Layer Sharding

Half the layers' KV stays on GPU, half on CPU. Predictable, simple, but every token incurs PCIe traffic. Used in older offloading schemes.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Token-Aged Eviction

Recent tokens stay hot; older tokens migrate to CPU. Combined with prefix-attention patterns this saves substantial bandwidth — most attention concentrates on recent tokens.

Streaming KV with Prefix Caching

For multi-turn conversations, the system prompt and conversation prefix are cached on disk and copied to GPU on demand. This is the foundation of "prompt caching" — Anthropic, OpenAI, and Google all implement variants. The cost reduction for repeated long prompts is 5-10x.

For multi-GPU deployments, NVLink-connected GPUs can share a single logical KV-cache. NVIDIA's NVLink Switch on Blackwell lets 72 GPUs share a 13 TB pool of KV at near-HBM speeds. This is the path for million-token serving at scale.

A Modern Inference Server Stack

flowchart TB
    Req[Request] --> S[Scheduler]
    S --> Loc[Locate prefix in cache]
    Loc -->|Hit, on GPU| Run[Run]
    Loc -->|Hit, on CPU| Move1[Move to GPU]
    Loc -->|Hit, on NVMe| Move2[Stream from NVMe]
    Loc -->|Miss| Comp[Compute prefix]
    Move1 --> Run
    Move2 --> Run
    Comp --> Run

vLLM with PagedAttention, SGLang with RadixAttention, and TensorRT-LLM all implement this hierarchy in 2026. The gains over flat KV management are large: 3-10x throughput on conversational workloads.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Bandwidth Math

A 70B BF16 model at 128K context per request needs roughly 18 GB of KV. To run 8 concurrent users you need 144 GB of KV — already over a single H100's 80 GB. The choices:

  • 2x H100 with tensor parallelism (cost: 2 H100s)
  • 1x H200 (141 GB) with offloading
  • 1x H100 with CPU offload over PCIe

PCIe Gen 5 x16 gives 64 GB/s. A 144 GB working set needs to be carefully sliced; full-cache transfers per token are not feasible. Token-aged eviction is what makes this practical.

What This Looks Like in vLLM 2026

vLLM 0.7+ ships:

  • PagedAttention as the GPU primitive (paged blocks instead of contiguous tensors)
  • CPU offload of older blocks
  • Disk offload (experimental as of early 2026)
  • Prefix caching with deduplication across requests

For most deployments, enabling --enable-prefix-caching and sizing CPU swap is the entire performance optimization. The system handles the rest.

Where Offloading Hurts

  • Very high concurrency: many users with many small KVs thrash; static partitioning beats hierarchical
  • Latency-sensitive workloads: any cache miss adds tens to hundreds of milliseconds; for sub-second SLAs you keep KV resident
  • Streaming voice agents: voice latency budgets are too tight for any cold-fetch path; size HBM to keep all active KV resident

Sources

## KV-Cache Offloading Strategies: CPU, GPU, and NVMe Tradeoffs in 2026 — operator perspective KV-Cache Offloading Strategies: CPU, GPU, and NVMe Tradeoffs in 2026 matters less for the headline than for what it forces operators to re-examine in their own stack — eval gates, fallback routing, and tool-call latency budgets. On the CallSphere side, the practical filter is simple: would this make a 90-second appointment-booking call faster, cheaper, or more reliable? If the answer is "maybe in a benchmark," it doesn't ship to production. ## Base model vs. production LLM stack — the gap that costs you uptime A base model is a checkpoint. A production LLM stack is a whole different artifact: eval gates that fail the build on regression, prompt caching that cuts repeated-system-prompt cost by 40-70%, structured outputs that prevent JSON drift on tool calls, fallback chains that route to a smaller-model retry when the primary times out, and request-side guardrails that cap tool calls per session before the loop spirals. CallSphere runs LLMs in tandem on purpose: `gpt-4o-realtime` for the live call (streaming audio in and out, tool calls inline) and `gpt-4o-mini` for post-call analytics (sentiment scoring, lead qualification, summary generation, and the lower-stakes async work that doesn't need realtime). That split is not a cost optimization — it's a reliability decision. Realtime is optimized for low-latency turn-taking; mini is optimized for cheap, deterministic batch scoring. Mixing them lets each do what it's good at without one regressing the other. The teams that struggle with LLMs in production almost always made the same mistake: they treated "the model" as a single dependency, instead of as a small portfolio of models, each pinned to a job, each behind its own eval suite, each with a documented fallback. ## FAQs **Q: Does kV-Cache Offloading Strategies actually move p95 latency or tool-call reliability?** A: Most of the time it doesn't, and that's the right starting assumption. The relevant test is whether it improves at least one of: p95 first-token latency, tool-call argument accuracy on noisy inputs, multi-turn handoff stability, or per-session cost. CallSphere runs 37 specialized AI agents wired to 90+ function tools across 115+ database tables in 6 live verticals. **Q: What would have to be true before kV-Cache Offloading Strategies ships into production?** A: The eval gate is unsentimental — a regression suite that simulates real call traffic (noisy ASR, partial inputs, tool-call timeouts) measures four numbers, and a candidate has to win on three of four without losing badly on the fourth. Anything else is treated as a blog post, not a stack change. **Q: Which CallSphere vertical would benefit from kV-Cache Offloading Strategies first?** A: In a CallSphere deployment, new model and API capabilities land first in the post-call analytics pipeline (lower stakes, async, easy to roll back) and only later in the live realtime path. Today the verticals most likely to absorb new capability first are Healthcare and After-Hours Escalation, which already run the largest share of production traffic. ## See it live Want to see after-hours escalation agents handle real traffic? Walk through https://escalation.callsphere.tech or grab 20 minutes with the founder: https://calendly.com/sagar-callsphere/new-meeting.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.