AI Agent for Tax Preparation: Document Collection, Categorization, and Form Filling
Learn to build an AI agent that collects tax documents, classifies them by type, extracts key financial data, and maps values to the correct tax form fields.
Why Tax Preparation Is Ripe for AI Agents
Tax preparation involves a predictable but tedious workflow: gather documents, classify them, extract data, apply tax rules, and fill forms. Each step follows clear rules, making it well-suited for an AI agent. The challenge lies in the variety of document formats (W-2s, 1099s, receipts, brokerage statements) and the complexity of tax code rules. An agent can handle the mechanical work while flagging edge cases for human review.
Agent Architecture
The tax prep agent has four stages:
flowchart LR
PDF(["PDF or image"])
OCR["OCR plus layout<br/>LayoutLM or Donut"]
DETECT["Table detector<br/>bounding boxes"]
STRUCT["Cell structure<br/>rows and columns"]
LLM["LLM normalization<br/>headers and types"]
VAL["Schema validation<br/>Pydantic"]
DB[(Structured store)]
OUT(["Clean rows"])
PDF --> OCR --> DETECT --> STRUCT --> LLM --> VAL --> DB --> OUT
style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
style VAL fill:#f59e0b,stroke:#d97706,color:#1f2937
style OUT fill:#059669,stroke:#047857,color:#fff
- Document Ingestion — accept files and extract text with OCR
- Document Classification — identify the type of each document
- Data Extraction — pull key financial figures from each document
- Form Mapping — apply tax rules and map values to form fields
Step 1: Document Ingestion and OCR
Many tax documents arrive as scanned PDFs or photos. We use OCR to extract text.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
import pytesseract
from PIL import Image
from pathlib import Path
import pdfplumber
def ingest_document(file_path: str) -> str:
"""Extract text from various document formats."""
path = Path(file_path)
suffix = path.suffix.lower()
if suffix in (".png", ".jpg", ".jpeg", ".tiff"):
image = Image.open(path)
return pytesseract.image_to_string(image)
elif suffix == ".pdf":
with pdfplumber.open(path) as pdf:
text = ""
for page in pdf.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
else:
# Fallback to OCR for scanned pages
img = page.to_image(resolution=300)
text += pytesseract.image_to_string(
img.original
) + "\n"
return text
elif suffix == ".txt":
return path.read_text()
raise ValueError(f"Unsupported format: {suffix}")
Step 2: Document Classification
The agent classifies each document into tax form categories.
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class DocumentClassification(BaseModel):
document_type: str # "W-2", "1099-INT", "1099-DIV", etc.
tax_year: int
issuer: str
confidence: float
recipient_name: str
DOCUMENT_TYPES = [
"W-2 (Wage and Tax Statement)",
"1099-INT (Interest Income)",
"1099-DIV (Dividends and Distributions)",
"1099-B (Broker Transactions)",
"1099-MISC (Miscellaneous Income)",
"1099-NEC (Nonemployee Compensation)",
"1098 (Mortgage Interest)",
"1098-T (Tuition Statement)",
"Receipt (Deductible Expense)",
"K-1 (Partner/Shareholder Income)",
"Other / Unknown",
]
def classify_document(text: str) -> DocumentClassification:
"""Classify a tax document by type."""
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Classify this tax document. Identify the form type, "
"tax year, issuer, and recipient.\n\n"
f"Valid types: {', '.join(DOCUMENT_TYPES)}"
),
},
{"role": "user", "content": text[:3000]},
],
response_format=DocumentClassification,
)
return response.choices[0].message.parsed
Step 3: Data Extraction by Document Type
Each document type has specific fields to extract. We use type-specific schemas.
class W2Data(BaseModel):
employer_name: str
employer_ein: str
wages: float # Box 1
federal_tax_withheld: float # Box 2
social_security_wages: float # Box 3
social_security_tax: float # Box 4
medicare_wages: float # Box 5
medicare_tax: float # Box 6
state: str
state_wages: float # Box 16
state_tax_withheld: float # Box 17
class Form1099INT(BaseModel):
payer_name: str
interest_income: float # Box 1
early_withdrawal_penalty: float # Box 2
us_savings_bond_interest: float # Box 3
federal_tax_withheld: float # Box 4
EXTRACTION_SCHEMAS = {
"W-2": W2Data,
"1099-INT": Form1099INT,
# Add more schemas for each document type
}
def extract_data(text: str, doc_type: str) -> BaseModel:
"""Extract structured data based on document type."""
schema = EXTRACTION_SCHEMAS.get(doc_type)
if not schema:
raise ValueError(f"No extraction schema for: {doc_type}")
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
f"Extract all fields for a {doc_type} form. "
"Use 0.0 for any field not found in the document."
),
},
{"role": "user", "content": text},
],
response_format=schema,
)
return response.choices[0].message.parsed
Step 4: Tax Rule Application and Form Mapping
After extraction, the agent applies tax rules to map values onto the correct lines of the tax return.
from dataclasses import dataclass, field
@dataclass
class TaxFormLine:
form: str # e.g., "1040"
line: str # e.g., "1a"
description: str
value: float = 0.0
@dataclass
class TaxReturn:
tax_year: int
filing_status: str
lines: dict[str, TaxFormLine] = field(default_factory=dict)
def add_to_line(self, line_key: str, amount: float):
if line_key in self.lines:
self.lines[line_key].value += amount
def get_line(self, line_key: str) -> float:
return self.lines.get(line_key, TaxFormLine("", "", "")).value
def build_1040(extracted_docs: list[dict]) -> TaxReturn:
"""Map extracted document data to Form 1040 lines."""
tax_return = TaxReturn(
tax_year=2025,
filing_status="single",
lines={
"1a": TaxFormLine("1040", "1a", "Wages", 0.0),
"2b": TaxFormLine("1040", "2b", "Taxable Interest", 0.0),
"3b": TaxFormLine("1040", "3b", "Ordinary Dividends", 0.0),
"25a": TaxFormLine("1040", "25a", "W-2 Withholding", 0.0),
},
)
for doc in extracted_docs:
doc_type = doc["type"]
data = doc["data"]
if doc_type == "W-2":
tax_return.add_to_line("1a", data.wages)
tax_return.add_to_line("25a", data.federal_tax_withheld)
elif doc_type == "1099-INT":
tax_return.add_to_line("2b", data.interest_income)
tax_return.add_to_line("25a", data.federal_tax_withheld)
return tax_return
Full Pipeline
def prepare_taxes(document_paths: list[str]) -> TaxReturn:
"""Run the full tax preparation pipeline."""
extracted_docs = []
for path in document_paths:
text = ingest_document(path)
classification = classify_document(text)
data = extract_data(text, classification.document_type)
extracted_docs.append({
"type": classification.document_type,
"data": data,
"source": path,
})
return build_1040(extracted_docs)
tax_return = prepare_taxes(["w2_2025.pdf", "1099_int.pdf"])
for key, line in tax_return.lines.items():
print(f"Line {line.line} ({line.description}): ${line.value:,.2f}")
FAQ
How does the agent handle discrepancies between documents?
The agent flags inconsistencies — for example, if total W-2 wages across multiple employers seem unreasonably high or if withholding amounts do not match expected rates. It generates a discrepancy report for human review rather than making assumptions.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Can this approach handle business tax returns (Schedule C, partnerships)?
Yes, but business returns are more complex. You would extend the extraction schemas for Schedule C, K-1 forms, and depreciation schedules. The tax rule engine needs additional logic for business deductions, self-employment tax, and estimated tax payments.
What about state tax returns?
State returns require state-specific rules. The agent can be extended with a state module that takes the federal return as input, applies state-specific adjustments (state-specific deductions, different tax brackets), and generates the appropriate state form. Each state would have its own rule configuration.
#TaxPreparation #DocumentClassification #OCR #FinancialAI #Automation #AgenticAI #LearnAI #AIEngineering
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.