LLM Inference Optimization: Quantization, Speculative Decoding, and Beyond
A technical guide to modern LLM inference optimization techniques — quantization, speculative decoding, KV-cache optimization, continuous batching, and PagedAttention. Make models faster and cheaper.
Why Inference Optimization Matters
Training a large language model is a one-time cost. Inference — serving predictions to users — is the ongoing expense that determines whether a model is economically viable in production. A model that costs $10 million to train but $0.001 per query can generate billions of responses profitably. The same model at $0.10 per query may be commercially unviable.
Inference optimization is the discipline of making models faster, cheaper, and more memory-efficient without sacrificing output quality. Here are the techniques that matter most in 2026.
Quantization: Trading Precision for Speed
Quantization reduces the numerical precision of model weights from 16-bit or 32-bit floating point to lower bit widths (8-bit, 4-bit, or even 2-bit integers).
Why it works: Most model weights cluster around small values. The difference between representing a weight as 0.0234375 (FP16) versus 0.023 (INT8) is negligible for output quality but halves memory usage.
Common quantization methods:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
| Method | Bits | Quality Loss | Speed Gain | Memory Reduction |
|---|---|---|---|---|
| FP16 (baseline) | 16 | None | 1x | 1x |
| INT8 (W8A8) | 8 | Minimal | 1.5-2x | 2x |
| GPTQ (W4A16) | 4 | Small | 2-3x | 4x |
| AWQ | 4 | Small | 2-3x | 4x |
| GGUF Q4_K_M | 4 | Small | 2-3x | 4x |
| QuIP# | 2 | Moderate | 4-5x | 8x |
Practical example: A 70B parameter model requires ~140GB in FP16, needing 2x A100 80GB GPUs. With 4-bit quantization, it fits on a single A100 or even a consumer RTX 4090 (24GB).
# Quantizing with llama.cpp
./quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
# Serving with vLLM and AWQ quantization
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-3.3-70B-AWQ \
--quantization awq \
--tensor-parallel-size 1
Speculative Decoding: Draft and Verify
LLM inference is bottlenecked by sequential token generation — each token requires a full forward pass. Speculative decoding breaks this bottleneck by using a small, fast "draft" model to generate candidate tokens, then verifying them in parallel with the large model.
flowchart TD
HUB(("Why Inference<br/>Optimization Matters"))
HUB --> L0["Quantization: Trading<br/>Precision for Speed"]
style L0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L1["Speculative Decoding: Draft<br/>and Verify"]
style L1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L2["KV-Cache Optimization"]
style L2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L3["PagedAttention and vLLM"]
style L3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L4["Continuous Batching"]
style L4 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L5["Putting It All Together"]
style L5 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style HUB fill:#4f46e5,stroke:#4338ca,color:#fff
How it works:
- The draft model (e.g., Llama 3.3 8B) generates K candidate tokens quickly
- The target model (e.g., Llama 3.3 70B) verifies all K tokens in a single forward pass
- Accepted tokens are kept; the first rejected token is replaced with the target model's choice
- The process repeats
Speedup: When the draft model's predictions match the target model (which happens 70-90% of the time for well-chosen pairs), you get K tokens for the cost of ~1 forward pass of the large model. Typical speedups: 2-3x for well-matched model pairs.
KV-Cache Optimization
During autoregressive generation, the Key-Value cache stores computed attention states for all previous tokens. This cache grows linearly with sequence length and can consume more memory than the model weights for long contexts.
Techniques:
- Multi-Query Attention (MQA): Share key/value heads across attention heads, reducing KV-cache by 8-32x
- Grouped-Query Attention (GQA): A middle ground — share KV heads in groups rather than fully
- KV-cache quantization: Compress cached key/value tensors to INT8, halving cache memory
- Sliding window attention: Limit attention to recent tokens plus landmark tokens, capping cache size
PagedAttention and vLLM
PagedAttention, the innovation behind vLLM, manages KV-cache memory the way operating systems manage virtual memory — in non-contiguous pages.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Problem solved: Traditional KV-cache allocation pre-allocates memory based on maximum sequence length, wasting memory for shorter sequences. With batch sizes of 100+ concurrent requests, this waste becomes the primary bottleneck.
How PagedAttention helps:
- Allocates KV-cache in small blocks (pages) on demand
- Eliminates memory waste from pre-allocation
- Enables sharing KV-cache pages across requests using the same prefix (prompt caching)
- Increases throughput by 2-4x compared to naive implementations
# vLLM automatically uses PagedAttention
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=2,
max_model_len=32768,
gpu_memory_utilization=0.90
)
outputs = llm.generate(
prompts=["Explain quantum computing" for _ in range(100)],
sampling_params=SamplingParams(temperature=0.7, max_tokens=512)
)
Continuous Batching
Traditional static batching waits for a full batch before processing and waits for the longest sequence to finish before returning any results. Continuous batching (also called iteration-level batching) inserts new requests and returns completed requests at every generation step.
Impact: Reduces average latency by 50-80% under load and increases throughput by 2-3x compared to static batching. All modern serving frameworks (vLLM, TGI, TensorRT-LLM) implement continuous batching by default.
Putting It All Together
A production-optimized inference stack combines multiple techniques:
Request → Continuous Batching Engine
├── PagedAttention (memory efficiency)
├── Quantized Model (INT8/INT4)
├── GQA/MQA (reduced KV-cache)
├── Speculative Decoding (speed)
└── Prefix Caching (shared prompts)
The compound effect of these optimizations is dramatic: a well-optimized serving stack can serve 10-50x more requests per GPU compared to a naive implementation, reducing per-query costs proportionally.
Sources: vLLM — PagedAttention Paper, Hugging Face — Quantization Guide, DeepSpeed — Inference Optimization
flowchart LR
IN(["Input prompt"])
subgraph PRE["Pre processing"]
TOK["Tokenize"]
EMB["Embed"]
end
subgraph CORE["Model Core"]
ATTN["Self attention layers"]
MLP["Feed forward layers"]
end
subgraph POST["Post processing"]
SAMP["Sampling"]
DETOK["Detokenize"]
end
OUT(["Generated text"])
IN --> TOK --> EMB --> ATTN --> MLP --> SAMP --> DETOK --> OUT
style IN fill:#f1f5f9,stroke:#64748b,color:#0f172a
style CORE fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style OUT fill:#059669,stroke:#047857,color:#fff
flowchart TD
HUB(("Why Inference<br/>Optimization Matters"))
HUB --> L0["Quantization: Trading<br/>Precision for Speed"]
style L0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L1["Speculative Decoding: Draft<br/>and Verify"]
style L1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L2["KV-Cache Optimization"]
style L2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L3["PagedAttention and vLLM"]
style L3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L4["Continuous Batching"]
style L4 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L5["Putting It All Together"]
style L5 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style HUB fill:#4f46e5,stroke:#4338ca,color:#fff
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.