LLM-Powered Data Extraction and Document Processing: Patterns That Work in 2026
Practical architectures for using LLMs to extract structured data from unstructured documents, covering schema design, chunking strategies, and production reliability patterns.
From Unstructured to Structured at Scale
Every enterprise sits on mountains of unstructured data: contracts, invoices, medical records, research papers, emails, support tickets. Extracting structured information from these documents has traditionally required custom NLP pipelines, regex patterns, and domain-specific models for each document type.
LLMs have changed this. A single model can extract structured data from virtually any document type with minimal customization. But doing this reliably at scale requires careful architecture.
The Basic Extraction Pattern
At its simplest, LLM-based extraction involves sending a document with a schema and asking the model to populate it:
from pydantic import BaseModel, Field
from typing import Optional
class InvoiceData(BaseModel):
vendor_name: str
invoice_number: str
invoice_date: str = Field(description="ISO 8601 format")
due_date: Optional[str] = None
line_items: list[LineItem]
subtotal: float
tax: float
total: float
currency: str = Field(default="USD")
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
total: float
# With Anthropic's structured output
response = client.messages.create(
model="claude-sonnet-4-20250514",
system="Extract invoice data from the provided document. "
"Return ONLY data explicitly stated in the document.",
messages=[{"role": "user", "content": document_text}],
tool_choice={"type": "tool", "name": "extract_invoice"},
tools=[{
"name": "extract_invoice",
"description": "Extract structured invoice data",
"input_schema": InvoiceData.model_json_schema()
}]
)
Chunking Strategies for Long Documents
Documents that exceed the model's context window (or are too expensive to process whole) need chunking. But naive chunking breaks extraction because relevant information may span chunk boundaries.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Sliding Window with Overlap:
flowchart TD
HUB(("From Unstructured to<br/>Structured at Scale"))
HUB --> L0["The Basic Extraction Pattern"]
style L0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L1["Chunking Strategies for Long<br/>Documents"]
style L1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L2["Handling Multi-Page<br/>Documents"]
style L2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L3["Quality Assurance Patterns"]
style L3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L4["Production Architecture"]
style L4 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L5["Cost Optimization"]
style L5 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style HUB fill:#4f46e5,stroke:#4338ca,color:#fff
def chunk_document(text, chunk_size=3000, overlap=500):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunks
Section-Aware Chunking: Parse the document structure first (headings, tables, paragraphs) and chunk at logical boundaries. This preserves the semantic integrity of each chunk.
Two-Pass Extraction: First pass identifies which sections contain relevant information. Second pass extracts from only those sections.
Handling Multi-Page Documents
For complex documents like contracts or medical records:
- Page-level extraction: Extract data from each page independently
- Merge and deduplicate: Combine results across pages, resolving conflicts
- Cross-reference validation: Check extracted values for consistency (e.g., does the sum of line items equal the total?)
async def extract_from_document(pages: list[str], schema: type[BaseModel]):
# Extract from each page in parallel
page_results = await asyncio.gather(*[
extract_page(page, schema) for page in pages
])
# Merge results with conflict resolution
merged = merge_extractions(page_results, strategy="highest_confidence")
# Validate consistency
validation_errors = validate_extraction(merged)
if validation_errors:
# Re-extract with targeted prompts for inconsistent fields
merged = await resolve_conflicts(merged, validation_errors, pages)
return merged
Quality Assurance Patterns
Confidence Scoring
Ask the model to rate its confidence for each extracted field:
class ExtractedField(BaseModel):
value: str
confidence: float = Field(ge=0, le=1, description="Extraction confidence 0-1")
source_text: str = Field(description="Exact text from document supporting this value")
Route low-confidence extractions to human review.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Dual Extraction
Run extraction twice (potentially with different models or prompts) and compare results. Disagreements flag potential errors:
- Both agree: high confidence, auto-accept
- One extraction has the field, other does not: medium confidence, review if critical
- Both have different values: low confidence, always route to human review
Schema Validation
Use Pydantic validators to catch impossible values:
from pydantic import validator
class InvoiceData(BaseModel):
total: float
line_items: list[LineItem]
@validator('total')
def total_matches_line_items(cls, v, values):
if 'line_items' in values:
expected = sum(item.total for item in values['line_items'])
if abs(v - expected) > 0.01:
raise ValueError(f"Total {v} doesn't match sum of line items {expected}")
return v
Production Architecture
A production document processing pipeline typically looks like:
Document Upload -> OCR (if scanned) -> Text Extraction
-> Classification (what type of document?)
-> Schema Selection (which extraction schema to use?)
-> Chunking -> Parallel Extraction -> Merge -> Validation
-> Confidence Routing:
High confidence -> Auto-accept -> Database
Low confidence -> Human Review Queue -> Database
Cost Optimization
Document extraction can be expensive at scale. Optimize by:
- Using cheaper models (Haiku, GPT-4o mini) for classification and simple extractions
- Reserving expensive models for complex documents or low-confidence re-extraction
- Caching extraction results for identical documents (hash-based dedup)
- Batch processing during off-peak hours for non-urgent documents
Sources: Anthropic Structured Output | LlamaIndex Document Processing | Unstructured.io
flowchart LR
IN(["Input prompt"])
subgraph PRE["Pre processing"]
TOK["Tokenize"]
EMB["Embed"]
end
subgraph CORE["Model Core"]
ATTN["Self attention layers"]
MLP["Feed forward layers"]
end
subgraph POST["Post processing"]
SAMP["Sampling"]
DETOK["Detokenize"]
end
OUT(["Generated text"])
IN --> TOK --> EMB --> ATTN --> MLP --> SAMP --> DETOK --> OUT
style IN fill:#f1f5f9,stroke:#64748b,color:#0f172a
style CORE fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style OUT fill:#059669,stroke:#047857,color:#fff
flowchart TD
HUB(("From Unstructured to<br/>Structured at Scale"))
HUB --> L0["The Basic Extraction Pattern"]
style L0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L1["Chunking Strategies for Long<br/>Documents"]
style L1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L2["Handling Multi-Page<br/>Documents"]
style L2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L3["Quality Assurance Patterns"]
style L3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L4["Production Architecture"]
style L4 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L5["Cost Optimization"]
style L5 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style HUB fill:#4f46e5,stroke:#4338ca,color:#fff
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.