Skip to content
Learn Agentic AI
Learn Agentic AI14 min read17 views

Building a Resume Parser with Structured Outputs: End-to-End Tutorial

Build a complete resume parsing pipeline from PDF to structured data. Covers PDF text extraction, schema design for work experience and education, LLM extraction, validation, and output formatting.

Why Build a Resume Parser?

Resume parsing is a classic structured extraction problem. Resumes contain predictable data types (names, dates, companies, skills) but wildly inconsistent formatting. Traditional regex-based parsers break on every new resume template. LLM-based parsers handle any format because they understand the content semantically, not syntactically.

In this tutorial, you will build a complete pipeline: PDF input, text extraction, LLM-powered structured extraction, validation, and clean JSON output.

Step 1: Define the Schema

Start by modeling what a parsed resume looks like:

flowchart LR
    CALLER(["Student or Parent"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Education AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Enrollment captured"])
        O2(["Tour scheduled"])
        O3(["Counselor callback"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
from pydantic import BaseModel, Field, field_validator
from typing import List, Optional
from datetime import date

class ContactInfo(BaseModel):
    full_name: str
    email: Optional[str] = None
    phone: Optional[str] = None
    location: Optional[str] = Field(default=None, description="City, State or City, Country")
    linkedin_url: Optional[str] = None
    portfolio_url: Optional[str] = None

class WorkExperience(BaseModel):
    company: str
    title: str
    start_date: Optional[str] = Field(default=None, description="YYYY-MM format")
    end_date: Optional[str] = Field(default=None, description="YYYY-MM or 'Present'")
    location: Optional[str] = None
    description: Optional[str] = None
    achievements: List[str] = Field(default_factory=list)

class Education(BaseModel):
    institution: str
    degree: Optional[str] = None
    field_of_study: Optional[str] = None
    start_date: Optional[str] = None
    end_date: Optional[str] = None
    gpa: Optional[float] = Field(default=None, ge=0.0, le=4.0)

class ParsedResume(BaseModel):
    contact: ContactInfo
    summary: Optional[str] = Field(default=None, description="Professional summary or objective")
    work_experience: List[WorkExperience]
    education: List[Education]
    skills: List[str]
    certifications: List[str] = Field(default_factory=list)
    languages: List[str] = Field(default_factory=list)

Design choices matter here. Using Optional with None defaults means the model will not hallucinate values for missing fields. The YYYY-MM format for dates handles the common resume pattern where exact days are not listed.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 2: Extract Text from PDF

Use PyMuPDF (fitz) for reliable text extraction:

pip install pymupdf
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from a PDF file, preserving basic structure."""
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        text = page.get_text("text")
        pages.append(text)
    doc.close()
    return "\n\n".join(pages)

# Usage
resume_text = extract_text_from_pdf("resume.pdf")
print(f"Extracted {len(resume_text)} characters")

PyMuPDF handles most PDF formats, including those with columns, tables, and embedded fonts. For scanned PDFs (images), you would need OCR — add pytesseract as a preprocessing step.

Step 3: LLM Extraction

Send the extracted text to the LLM with your schema:

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

def parse_resume(resume_text: str) -> ParsedResume:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=ParsedResume,
        max_retries=3,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert resume parser. Extract structured data "
                    "from the resume text. Rules:\n"
                    "- Only extract information explicitly stated in the resume\n"
                    "- Use null for fields not present in the text\n"
                    "- List achievements as separate bullet points\n"
                    "- Normalize dates to YYYY-MM format when possible\n"
                    "- List skills as individual items, not comma-separated strings"
                )
            },
            {"role": "user", "content": resume_text}
        ],
    )

Step 4: Add Validation

Add validators that catch common LLM extraction errors:

from pydantic import model_validator
import re

class ValidatedResume(ParsedResume):

    @model_validator(mode="after")
    def validate_work_dates(self) -> "ValidatedResume":
        """Ensure work experience dates are chronologically valid."""
        date_pattern = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")

        for job in self.work_experience:
            if job.start_date and not date_pattern.match(job.start_date):
                if job.start_date.lower() != "present":
                    raise ValueError(
                        f"Invalid start_date format: '{job.start_date}' for {job.company}"
                    )
            if job.end_date and job.end_date.lower() != "present":
                if not date_pattern.match(job.end_date):
                    raise ValueError(
                        f"Invalid end_date format: '{job.end_date}' for {job.company}"
                    )
        return self

    @field_validator("skills")
    @classmethod
    def deduplicate_skills(cls, v: List[str]) -> List[str]:
        """Remove duplicate skills (case-insensitive)."""
        seen = set()
        unique = []
        for skill in v:
            normalized = skill.lower().strip()
            if normalized not in seen:
                seen.add(normalized)
                unique.append(skill.strip())
        return unique

When Instructor detects a validation error, it automatically retries the LLM call with the error message appended. The model sees "Invalid start_date format: 'March 2022'" and corrects it to "2022-03" on the next attempt.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 5: Output Formatting

Convert the parsed resume to your target format:

import json

def resume_to_json(parsed: ParsedResume) -> str:
    """Export parsed resume as formatted JSON."""
    return parsed.model_dump_json(indent=2, exclude_none=True)

def resume_to_csv_row(parsed: ParsedResume) -> dict:
    """Flatten resume for CSV/spreadsheet export."""
    return {
        "name": parsed.contact.full_name,
        "email": parsed.contact.email,
        "phone": parsed.contact.phone,
        "location": parsed.contact.location,
        "latest_company": parsed.work_experience[0].company if parsed.work_experience else None,
        "latest_title": parsed.work_experience[0].title if parsed.work_experience else None,
        "years_experience": len(parsed.work_experience),
        "highest_degree": parsed.education[0].degree if parsed.education else None,
        "skills": ", ".join(parsed.skills),
        "num_certifications": len(parsed.certifications),
    }

Complete Pipeline

def process_resume(pdf_path: str) -> dict:
    """End-to-end resume processing pipeline."""
    # Extract text
    text = extract_text_from_pdf(pdf_path)

    if len(text.strip()) < 50:
        raise ValueError("PDF appears empty or unreadable. Try OCR.")

    # Parse with LLM
    parsed = parse_resume(text)

    # Return structured output
    return {
        "parsed": parsed.model_dump(exclude_none=True),
        "json": resume_to_json(parsed),
        "csv_row": resume_to_csv_row(parsed),
    }

result = process_resume("candidate_resume.pdf")
print(json.dumps(result["parsed"], indent=2))

FAQ

How accurate is LLM-based resume parsing compared to commercial parsers?

In tests on diverse resume formats, GPT-4o achieves 90-95% field-level accuracy on standard fields like name, email, and company names. Commercial parsers like Sovren or Textkernel achieve similar accuracy on standard formats but struggle more with creative or non-standard layouts where LLMs excel.

How do I handle multi-page resumes?

PyMuPDF concatenates all pages automatically. For resumes over 4 pages, the full text may exceed the model's optimal extraction context. In that case, extract contact info and summary from page 1, work experience from middle pages, and education/skills from the final section — then merge the results.

What about data privacy when sending resumes to OpenAI?

Resumes contain sensitive personal information. Use OpenAI's API data usage policy (API data is not used for training by default). For strict privacy requirements, run a local model via Ollama or vLLM with Instructor's OpenAI-compatible mode. This keeps all data on your infrastructure.


#ResumeParser #DataExtraction #PDF #StructuredOutputs #Tutorial #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.