Chunking Strategies Compared: Recursive, Semantic, Late, and Contextual Chunking
How you chunk decides what your RAG retrieves. The 2026 chunking strategies — recursive, semantic, late, contextual — benchmarked side-by-side.
Why Chunking Decides Recall
Retrieval quality starts with chunking. A chunked document is what gets indexed; what gets retrieved is by definition a chunk. Chunks too small lose context; chunks too large dilute embeddings; chunks split mid-sentence cripple recall.
The 2026 chunking landscape has four main approaches. They differ in cost, complexity, and where they win.
The Four Approaches
flowchart LR
Doc[Document] --> R[Recursive<br/>character / token]
Doc --> S[Semantic<br/>break on topic shifts]
Doc --> L[Late chunking<br/>embed long, chunk after]
Doc --> C[Contextual chunking<br/>prepend doc summary]
Recursive Chunking
The default in LangChain and LlamaIndex. Walk the text by separators (paragraph → sentence → word) recursively until the chunk is below a target size. Cheap, deterministic, language-agnostic.
- Pros: predictable, fast, easy
- Cons: blind to semantics; can split related ideas
Semantic Chunking
Embed each sentence, find topic-shift points (where similarity drops), break there. Chunks align with topical boundaries.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Pros: keeps coherent ideas together
- Cons: more expensive (embedding per sentence at index time); break-detection is sensitive to threshold
Late Chunking
Embed the entire document at once with a long-context embedding model (Jina-embeddings-v3, BGE-M3 long), then split the resulting token-level vectors into chunks. The chunks share context from the whole document because the embeddings were computed on the full document.
- Pros: each chunk's embedding sees the whole document; context-aware vectors
- Cons: requires a long-context embedding model; more compute up front
Contextual Chunking (Anthropic)
Anthropic's late-2024 technique: for each chunk, prepend a 1-2 sentence summary of the whole document explaining where the chunk fits. Embed the augmented chunk. Big recall gains; the cost is one LLM call per chunk at index time.
- Pros: best recall on benchmark tasks; addresses the "chunk lost its parent context" problem
- Cons: expensive at index time (LLM call per chunk)
Benchmark Numbers
On a standard mixed corpus, 2025-2026 numbers:
| Strategy | Recall@5 | Index cost (rel.) | Latency |
|---|---|---|---|
| Recursive | 71% | 1x | fast |
| Semantic | 76% | 3x | fast |
| Late | 78% | 5x | fast |
| Contextual | 84% | 30x | fast |
| Contextual + RRF (BM25 + dense) | 91% | 30x | fast |
Contextual chunking is the recall champion. The 30x index-time cost is acceptable for static or slow-changing corpora; not great for high-velocity ones.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
How to Choose
flowchart TD
Q1{Corpus updates<br/>frequently?} -->|Yes| Q2{Recall critical?}
Q1 -->|No| Q3{Recall critical?}
Q2 -->|Yes| Sem[Semantic + late]
Q2 -->|No| Rec[Recursive]
Q3 -->|Yes| Con[Contextual]
Q3 -->|No| Late[Late chunking]
For most teams in 2026:
- High-velocity corpus + cost-sensitive: recursive
- High-velocity corpus + recall-critical: semantic + late hybrid
- Static corpus + recall-critical: contextual
- Static corpus + cost-sensitive: late chunking
Chunk Size
Chunk size matters as much as strategy. The 2026 rule of thumb:
- 200-400 tokens for fact-heavy queries (precise retrieval)
- 800-1200 tokens for synthesis queries (more context per chunk)
- Always with 10-20 percent overlap
Larger chunks reduce noise; smaller chunks improve precision. The right size is workload-specific; benchmark on real queries.
Special Document Types
Different docs need different chunking:
- Code: respect class and function boundaries; use AST-aware chunkers (LlamaIndex, Tree-sitter)
- Markdown: chunk by headers, then by paragraphs
- PDFs with tables: do not chunk through tables; treat tables as atomic units
- Long-form narrative: late or contextual chunking outperforms naive recursive
- Transcripts: speaker-turn chunking with overlap
Implementation Notes
- Always store the original chunk text alongside the embedding
- Store doc-level metadata (title, date, source) on every chunk
- Track chunk position in the doc so you can fetch neighbors when needed
- Re-chunk periodically when your strategy changes; keep both versions during the transition
Sources
- Anthropic contextual retrieval — https://www.anthropic.com/news/contextual-retrieval
- Jina late chunking — https://jina.ai/news/late-chunking
- "Semantic chunking" LlamaIndex — https://docs.llamaindex.ai
- "BGE-M3" paper — https://arxiv.org/abs/2402.03216
- "Chunking strategies for RAG" — https://www.pinecone.io/learn
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.