Building a Resume Parser with Structured Outputs: End-to-End Tutorial
Build a complete resume parsing pipeline from PDF to structured data. Covers PDF text extraction, schema design for work experience and education, LLM extraction, validation, and output formatting.
Why Build a Resume Parser?
Resume parsing is a classic structured extraction problem. Resumes contain predictable data types (names, dates, companies, skills) but wildly inconsistent formatting. Traditional regex-based parsers break on every new resume template. LLM-based parsers handle any format because they understand the content semantically, not syntactically.
In this tutorial, you will build a complete pipeline: PDF input, text extraction, LLM-powered structured extraction, validation, and clean JSON output.
Step 1: Define the Schema
Start by modeling what a parsed resume looks like:
flowchart LR
CALLER(["Student or Parent"])
subgraph TEL["Telephony"]
SIP["Twilio SIP and PSTN"]
end
subgraph BRAIN["Education AI Agent"]
STT["Streaming STT<br/>Deepgram or Whisper"]
NLU{"Intent and<br/>Entity Extraction"}
TOOLS["Tool Calls"]
TTS["Streaming TTS<br/>ElevenLabs or Rime"]
end
subgraph DATA["Live Data Plane"]
CRM[("CRM and Notes")]
CAL[("Calendar and<br/>Schedule")]
KB[("Knowledge Base<br/>and Policies")]
end
subgraph OUT["Outcomes"]
O1(["Enrollment captured"])
O2(["Tour scheduled"])
O3(["Counselor callback"])
end
CALLER --> SIP --> STT --> NLU
NLU -->|Lookup| TOOLS
TOOLS <--> CRM
TOOLS <--> CAL
TOOLS <--> KB
NLU --> TTS --> SIP --> CALLER
NLU -->|Resolved| O1
NLU -->|Schedule| O2
NLU -->|Escalate| O3
style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
style O1 fill:#059669,stroke:#047857,color:#fff
style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
from pydantic import BaseModel, Field, field_validator
from typing import List, Optional
from datetime import date
class ContactInfo(BaseModel):
full_name: str
email: Optional[str] = None
phone: Optional[str] = None
location: Optional[str] = Field(default=None, description="City, State or City, Country")
linkedin_url: Optional[str] = None
portfolio_url: Optional[str] = None
class WorkExperience(BaseModel):
company: str
title: str
start_date: Optional[str] = Field(default=None, description="YYYY-MM format")
end_date: Optional[str] = Field(default=None, description="YYYY-MM or 'Present'")
location: Optional[str] = None
description: Optional[str] = None
achievements: List[str] = Field(default_factory=list)
class Education(BaseModel):
institution: str
degree: Optional[str] = None
field_of_study: Optional[str] = None
start_date: Optional[str] = None
end_date: Optional[str] = None
gpa: Optional[float] = Field(default=None, ge=0.0, le=4.0)
class ParsedResume(BaseModel):
contact: ContactInfo
summary: Optional[str] = Field(default=None, description="Professional summary or objective")
work_experience: List[WorkExperience]
education: List[Education]
skills: List[str]
certifications: List[str] = Field(default_factory=list)
languages: List[str] = Field(default_factory=list)
Design choices matter here. Using Optional with None defaults means the model will not hallucinate values for missing fields. The YYYY-MM format for dates handles the common resume pattern where exact days are not listed.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 2: Extract Text from PDF
Use PyMuPDF (fitz) for reliable text extraction:
pip install pymupdf
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path: str) -> str:
"""Extract text from a PDF file, preserving basic structure."""
doc = fitz.open(pdf_path)
pages = []
for page in doc:
text = page.get_text("text")
pages.append(text)
doc.close()
return "\n\n".join(pages)
# Usage
resume_text = extract_text_from_pdf("resume.pdf")
print(f"Extracted {len(resume_text)} characters")
PyMuPDF handles most PDF formats, including those with columns, tables, and embedded fonts. For scanned PDFs (images), you would need OCR — add pytesseract as a preprocessing step.
Step 3: LLM Extraction
Send the extracted text to the LLM with your schema:
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
def parse_resume(resume_text: str) -> ParsedResume:
return client.chat.completions.create(
model="gpt-4o",
response_model=ParsedResume,
max_retries=3,
messages=[
{
"role": "system",
"content": (
"You are an expert resume parser. Extract structured data "
"from the resume text. Rules:\n"
"- Only extract information explicitly stated in the resume\n"
"- Use null for fields not present in the text\n"
"- List achievements as separate bullet points\n"
"- Normalize dates to YYYY-MM format when possible\n"
"- List skills as individual items, not comma-separated strings"
)
},
{"role": "user", "content": resume_text}
],
)
Step 4: Add Validation
Add validators that catch common LLM extraction errors:
from pydantic import model_validator
import re
class ValidatedResume(ParsedResume):
@model_validator(mode="after")
def validate_work_dates(self) -> "ValidatedResume":
"""Ensure work experience dates are chronologically valid."""
date_pattern = re.compile(r"^\d{4}-(0[1-9]|1[0-2])$")
for job in self.work_experience:
if job.start_date and not date_pattern.match(job.start_date):
if job.start_date.lower() != "present":
raise ValueError(
f"Invalid start_date format: '{job.start_date}' for {job.company}"
)
if job.end_date and job.end_date.lower() != "present":
if not date_pattern.match(job.end_date):
raise ValueError(
f"Invalid end_date format: '{job.end_date}' for {job.company}"
)
return self
@field_validator("skills")
@classmethod
def deduplicate_skills(cls, v: List[str]) -> List[str]:
"""Remove duplicate skills (case-insensitive)."""
seen = set()
unique = []
for skill in v:
normalized = skill.lower().strip()
if normalized not in seen:
seen.add(normalized)
unique.append(skill.strip())
return unique
When Instructor detects a validation error, it automatically retries the LLM call with the error message appended. The model sees "Invalid start_date format: 'March 2022'" and corrects it to "2022-03" on the next attempt.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 5: Output Formatting
Convert the parsed resume to your target format:
import json
def resume_to_json(parsed: ParsedResume) -> str:
"""Export parsed resume as formatted JSON."""
return parsed.model_dump_json(indent=2, exclude_none=True)
def resume_to_csv_row(parsed: ParsedResume) -> dict:
"""Flatten resume for CSV/spreadsheet export."""
return {
"name": parsed.contact.full_name,
"email": parsed.contact.email,
"phone": parsed.contact.phone,
"location": parsed.contact.location,
"latest_company": parsed.work_experience[0].company if parsed.work_experience else None,
"latest_title": parsed.work_experience[0].title if parsed.work_experience else None,
"years_experience": len(parsed.work_experience),
"highest_degree": parsed.education[0].degree if parsed.education else None,
"skills": ", ".join(parsed.skills),
"num_certifications": len(parsed.certifications),
}
Complete Pipeline
def process_resume(pdf_path: str) -> dict:
"""End-to-end resume processing pipeline."""
# Extract text
text = extract_text_from_pdf(pdf_path)
if len(text.strip()) < 50:
raise ValueError("PDF appears empty or unreadable. Try OCR.")
# Parse with LLM
parsed = parse_resume(text)
# Return structured output
return {
"parsed": parsed.model_dump(exclude_none=True),
"json": resume_to_json(parsed),
"csv_row": resume_to_csv_row(parsed),
}
result = process_resume("candidate_resume.pdf")
print(json.dumps(result["parsed"], indent=2))
FAQ
How accurate is LLM-based resume parsing compared to commercial parsers?
In tests on diverse resume formats, GPT-4o achieves 90-95% field-level accuracy on standard fields like name, email, and company names. Commercial parsers like Sovren or Textkernel achieve similar accuracy on standard formats but struggle more with creative or non-standard layouts where LLMs excel.
How do I handle multi-page resumes?
PyMuPDF concatenates all pages automatically. For resumes over 4 pages, the full text may exceed the model's optimal extraction context. In that case, extract contact info and summary from page 1, work experience from middle pages, and education/skills from the final section — then merge the results.
What about data privacy when sending resumes to OpenAI?
Resumes contain sensitive personal information. Use OpenAI's API data usage policy (API data is not used for training by default). For strict privacy requirements, run a local model via Ollama or vLLM with Instructor's OpenAI-compatible mode. This keeps all data on your infrastructure.
#ResumeParser #DataExtraction #PDF #StructuredOutputs #Tutorial #AgenticAI #LearnAI #AIEngineering
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.