Building Document Processing Agents: PDF, Email, and Spreadsheet Automation

The Case for Document Processing Agents

Every enterprise runs on documents. Invoices arrive as PDFs. Contracts land in email attachments. Financial reports live in spreadsheets. Teams spend thousands of hours per year manually extracting data from these documents, classifying them, routing them to the right people, and entering the results into downstream systems.

Document processing agents automate this entire pipeline. Unlike simple OCR tools or rule-based extractors, agents understand context, handle edge cases, and adapt to format variations without reprogramming. An agent processing invoices does not just extract the total — it validates line items against purchase orders, flags discrepancies, and routes exceptions to the right approver.

PDF Parsing and Extraction

PDFs are the most challenging document format because they encode visual layout rather than semantic structure. A table in a PDF is just a collection of text fragments positioned at specific coordinates — there is no table element. Modern PDF processing combines layout analysis with LLM-based extraction to handle this.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

import fitz  # PyMuPDF
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from pathlib import Path

class InvoiceData(BaseModel):
    vendor_name: str
    invoice_number: str
    invoice_date: str
    due_date: str
    line_items: list[dict] = Field(
        description="List of {description, quantity, unit_price, total}"
    )
    subtotal: float
    tax: float
    total: float
    payment_terms: str | None = None

class PDFProcessor:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)

    def extract_text_with_layout(self, pdf_path: str) -> str:
        doc = fitz.open(pdf_path)
        full_text = []
        for page_num, page in enumerate(doc):
            blocks = page.get_text("blocks")
            blocks.sort(key=lambda b: (b[1], b[0]))  # sort by y, then x
            page_text = []
            for block in blocks:
                text = block[4].strip()
                if text:
                    page_text.append(text)
            full_text.append(
                f"=== Page {page_num + 1} ===
" + "
".join(page_text)
            )
        doc.close()
        return "

".join(full_text)

    def extract_tables(self, pdf_path: str) -> list[list[list[str]]]:
        doc = fitz.open(pdf_path)
        tables = []
        for page in doc:
            tabs = page.find_tables()
            for tab in tabs:
                table_data = tab.extract()
                if table_data:
                    tables.append(table_data)
        doc.close()
        return tables

    async def extract_invoice(self, pdf_path: str) -> InvoiceData:
        text = self.extract_text_with_layout(pdf_path)
        tables = self.extract_tables(pdf_path)

        prompt = f"""Extract invoice data from this PDF content.

Text content:
{text}

Tables found:
{tables}

Extract all fields precisely. For line items, include every row
from the invoice table. Calculate and verify the total matches
the sum of line items plus tax."""

        extractor = self.llm.with_structured_output(InvoiceData)
        return await extractor.ainvoke(prompt)

For handling scanned PDFs (image-based), add an OCR layer before extraction:

import pytesseract
from pdf2image import convert_from_path

class ScannedPDFProcessor(PDFProcessor):
    def extract_text_with_layout(self, pdf_path: str) -> str:
        # First try direct text extraction
        text = super().extract_text_with_layout(pdf_path)
        if len(text.strip()) > 100:
            return text

        # Fall back to OCR for scanned documents
        images = convert_from_path(pdf_path, dpi=300)
        ocr_texts = []
        for i, image in enumerate(images):
            ocr_text = pytesseract.image_to_string(image)
            ocr_texts.append(f"=== Page {i + 1} ===
{ocr_text}")
        return "

".join(ocr_texts)

Email Classification and Routing Agent

Email processing agents need to classify incoming messages, extract actionable information, and route them to the right team or workflow. The agent architecture uses a classifier stage followed by specialized extractors for each email type.

from enum import Enum
from pydantic import BaseModel, Field
import imaplib
import email
from email.header import decode_header

class EmailCategory(str, Enum):
    INVOICE = "invoice"
    SUPPORT_REQUEST = "support_request"
    SALES_INQUIRY = "sales_inquiry"
    COMPLIANCE = "compliance"
    INTERNAL = "internal"
    SPAM = "spam"

class ClassifiedEmail(BaseModel):
    category: EmailCategory
    priority: str = Field(description="high, medium, or low")
    summary: str = Field(description="One-sentence summary")
    action_required: str = Field(description="What action is needed")
    route_to: str = Field(description="Team or person to route to")

class EmailAgent:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.routing_rules = {
            EmailCategory.INVOICE: "[email protected]",
            EmailCategory.SUPPORT_REQUEST: "support-queue",
            EmailCategory.SALES_INQUIRY: "sales-team",
            EmailCategory.COMPLIANCE: "[email protected]",
            EmailCategory.INTERNAL: "auto-archive",
            EmailCategory.SPAM: "trash",
        }

    async def classify(
        self, subject: str, body: str, sender: str
    ) -> ClassifiedEmail:
        prompt = f"""Classify this email and determine routing.

From: {sender}
Subject: {subject}
Body: {body[:2000]}

Categories: invoice, support_request, sales_inquiry,
compliance, internal, spam

Priority rules:
- high: legal/compliance, payment issues, outages
- medium: support requests, sales with budget mentioned
- low: general inquiries, internal updates"""

        classifier = self.llm.with_structured_output(ClassifiedEmail)
        result = await classifier.ainvoke(prompt)

        # Apply routing rules
        if result.route_to == "auto":
            result.route_to = self.routing_rules.get(
                result.category, "general-inbox"
            )
        return result

    async def process_inbox(self, imap_config: dict) -> list[ClassifiedEmail]:
        mail = imaplib.IMAP4_SSL(imap_config["host"])
        mail.login(imap_config["user"], imap_config["password"])
        mail.select("inbox")

        _, messages = mail.search(None, "UNSEEN")
        results = []

        for msg_id in messages[0].split():
            _, data = mail.fetch(msg_id, "(RFC822)")
            msg = email.message_from_bytes(data[0][1])

            subject = decode_header(msg["Subject"])[0][0]
            if isinstance(subject, bytes):
                subject = subject.decode()
            sender = msg["From"]
            body = self._get_body(msg)

            classified = await self.classify(subject, body, sender)
            results.append(classified)

        mail.logout()
        return results

    def _get_body(self, msg) -> str:
        if msg.is_multipart():
            for part in msg.walk():
                if part.get_content_type() == "text/plain":
                    return part.get_payload(decode=True).decode(
                        errors="replace"
                    )
        return msg.get_payload(decode=True).decode(errors="replace")

Spreadsheet Analysis Agent

Spreadsheet agents read, analyze, and generate reports from Excel and CSV files. The key challenge is understanding the structure of arbitrary spreadsheets — column meanings, data types, relationships between sheets, and implicit business rules.

import pandas as pd
from langchain.tools import tool

class SpreadsheetAgent:
    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.loaded_data: dict[str, pd.DataFrame] = {}

    def load_file(self, path: str) -> dict[str, pd.DataFrame]:
        if path.endswith(".csv"):
            df = pd.read_csv(path)
            self.loaded_data["Sheet1"] = df
        else:
            xls = pd.ExcelFile(path)
            for sheet in xls.sheet_names:
                self.loaded_data[sheet] = pd.read_excel(xls, sheet)
        return self.loaded_data

    def get_schema(self) -> str:
        schema_parts = []
        for name, df in self.loaded_data.items():
            schema_parts.append(f"Sheet: {name}")
            schema_parts.append(f"  Rows: {len(df)}")
            schema_parts.append(f"  Columns:")
            for col in df.columns:
                dtype = str(df[col].dtype)
                sample = str(df[col].dropna().iloc[0]) if len(df[col].dropna()) > 0 else "N/A"
                nulls = df[col].isnull().sum()
                schema_parts.append(
                    f"    - {col} ({dtype}, nulls: {nulls}, sample: {sample})"
                )
        return "
".join(schema_parts)

    async def analyze(self, question: str) -> str:
        schema = self.get_schema()
        prompt = f"""You are a data analyst. Given this spreadsheet schema,
write Python pandas code to answer the question.

Schema:
{schema}

Question: {question}

Return ONLY executable Python code that uses the variable 'df'
(for single sheet) or 'sheets' dict (for multi-sheet).
Print the result."""

        response = await self.llm.ainvoke(prompt)
        code = self._extract_code(response.content)

        # Execute in sandboxed environment
        local_vars = {"pd": pd}
        if len(self.loaded_data) == 1:
            local_vars["df"] = list(self.loaded_data.values())[0]
        else:
            local_vars["sheets"] = self.loaded_data

        import io, contextlib
        output = io.StringIO()
        with contextlib.redirect_stdout(output):
            exec(code, {"__builtins__": {}}, local_vars)
        return output.getvalue()

    def _extract_code(self, text: str) -> str:
        if "~~~" in text:
            blocks = text.split("~~~")
            if len(blocks) >= 3:
                code_block = blocks[1]
                if code_block.startswith("python"):
                    code_block = code_block[6:]
                return code_block.strip()
        return text.strip()

Orchestrating the Full Pipeline

In production, these processors work together. An email arrives with a PDF attachment. The email agent classifies it as an invoice, the PDF processor extracts structured data, the spreadsheet agent updates the accounts payable tracker, and the system sends a notification to the approver.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

class DocumentPipelineAgent:
    def __init__(self):
        self.email_agent = EmailAgent()
        self.pdf_processor = PDFProcessor()
        self.spreadsheet_agent = SpreadsheetAgent()

    async def process_email_with_attachments(
        self, subject: str, body: str, sender: str,
        attachments: list[tuple[str, bytes]]
    ) -> dict:
        # Step 1: Classify the email
        classification = await self.email_agent.classify(
            subject, body, sender
        )

        results = {"classification": classification, "extractions": []}

        # Step 2: Process attachments based on classification
        for filename, content in attachments:
            if filename.endswith(".pdf"):
                if classification.category == EmailCategory.INVOICE:
                    invoice = await self.pdf_processor.extract_invoice(
                        self._save_temp(filename, content)
                    )
                    results["extractions"].append({
                        "type": "invoice",
                        "data": invoice.model_dump()
                    })

            elif filename.endswith((".xlsx", ".csv")):
                path = self._save_temp(filename, content)
                self.spreadsheet_agent.load_file(path)
                summary = await self.spreadsheet_agent.analyze(
                    "Provide a summary of key metrics"
                )
                results["extractions"].append({
                    "type": "spreadsheet_summary",
                    "data": summary
                })

        return results

FAQ

How do I handle PDFs with complex layouts like multi-column text or nested tables?

For complex layouts, use a layout analysis model like LayoutLM or Docling before text extraction. These models detect regions (headers, paragraphs, tables, figures) and their reading order. PyMuPDF's block-level extraction preserves some layout, but for truly complex documents (academic papers, financial statements with nested tables), you need a dedicated layout parser. The LLM extraction step then works with properly ordered text rather than a jumbled mix of columns.

What is the accuracy of LLM-based document extraction compared to template-based approaches?

Template-based extraction (defining exact regions for each field) achieves 98-99% accuracy on documents that match the template. LLM-based extraction typically achieves 92-96% accuracy but works across format variations without template creation. The recommended production approach is hybrid: use templates for high-volume, standardized documents (like invoices from your top 10 vendors) and LLM extraction for everything else. Always include a confidence score and route low-confidence extractions to human review.

How should I handle sensitive data in document processing pipelines?

Never send unredacted documents to external LLM APIs if they contain PII, PHI, or financial account numbers. Use on-premise models (Llama, Mistral) or Azure OpenAI with data processing agreements for sensitive documents. Implement a pre-processing step that detects and masks sensitive fields before LLM processing, then re-injects the original values into the structured output. Log extracted data to encrypted storage only and implement access controls on the extraction results.

#DocumentProcessing #PDFExtraction #EmailAutomation #SpreadsheetAI #Automation #AIAgents #OCR #DataExtraction

Building Document Processing Agents: PDF, Email, and Spreadsheet Automation

The Case for Document Processing Agents

PDF Parsing and Extraction

Email Classification and Routing Agent

Spreadsheet Analysis Agent

Orchestrating the Full Pipeline

FAQ

How do I handle PDFs with complex layouts like multi-column text or nested tables?

What is the accuracy of LLM-based document extraction compared to template-based approaches?

How should I handle sensitive data in document processing pipelines?

Try CallSphere AI Voice Agents

Related Articles You May Like

Chunking Strategies Compared: Recursive, Semantic, Late, and Contextual Chunking

AI Voice Agents for Prior Authorization: Automating the Payer Phone Call Hellscape

AI Voice Agent Appointment Booking Automation Guide

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

mcp-linear in 2026: Agentic Ticket Triage and the End of Manual Issue Tracking

mcp-airtable in 2026: 24-Action Ops Agents on Top of Your Bases