Skip to content
Learn Agentic AI
Learn Agentic AI13 min read8 views

Handwriting Recognition with AI Agents: Processing Handwritten Forms and Notes

Build an AI agent pipeline for handwriting recognition that processes handwritten forms and notes, extracts field values with confidence scoring, and routes low-confidence results to human reviewers for correction.

The Handwriting Problem

Despite decades of digitization, handwritten documents remain everywhere: patient intake forms, field inspection reports, warehouse inventory sheets, insurance claims, and school exams. These documents contain critical information locked in a format that traditional OCR struggles with.

Handwriting recognition (HTR — Handwritten Text Recognition) differs from printed text OCR in fundamental ways. Characters are connected, spacing is irregular, the same person writes the same letter differently depending on context, and individual writing styles vary enormously. Modern deep learning approaches have made HTR dramatically more capable, but building a production pipeline still requires careful engineering around confidence scoring, field extraction, and human review routing.

Setting Up the HTR Pipeline

pip install pytesseract opencv-python-headless Pillow torch torchvision transformers openai pydantic

Preprocessing Handwritten Documents

Handwritten forms need more aggressive preprocessing than printed documents:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    PDF(["PDF or image"])
    OCR["OCR plus layout<br/>LayoutLM or Donut"]
    DETECT["Table detector<br/>bounding boxes"]
    STRUCT["Cell structure<br/>rows and columns"]
    LLM["LLM normalization<br/>headers and types"]
    VAL["Schema validation<br/>Pydantic"]
    DB[(Structured store)]
    OUT(["Clean rows"])
    PDF --> OCR --> DETECT --> STRUCT --> LLM --> VAL --> DB --> OUT
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style VAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OUT fill:#059669,stroke:#047857,color:#fff
import cv2
import numpy as np

def preprocess_handwriting(image_path: str) -> np.ndarray:
    """Preprocess handwritten document for recognition."""
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Remove ruled lines (common in forms)
    horizontal_kernel = cv2.getStructuringElement(
        cv2.MORPH_RECT, (40, 1)
    )
    detected_lines = cv2.morphologyEx(
        gray, cv2.MORPH_OPEN, horizontal_kernel
    )
    # Subtract lines from image
    clean = cv2.subtract(gray, detected_lines)

    # Adaptive binarization works better for variable ink density
    binary = cv2.adaptiveThreshold(
        clean, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY_INV, 21, 10
    )

    # Remove small noise blobs
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (2, 2))
    cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

    return cleaned

Line and Word Segmentation

Before recognition, segment the document into individual lines and words:

from dataclasses import dataclass

@dataclass
class TextLine:
    image: np.ndarray
    bbox: tuple  # (x, y, w, h)
    line_number: int

@dataclass
class Word:
    image: np.ndarray
    bbox: tuple
    line_number: int
    word_index: int

def segment_lines(binary_image: np.ndarray) -> list[TextLine]:
    """Segment handwritten text into individual lines."""
    # Horizontal projection to find line boundaries
    h_projection = np.sum(binary_image, axis=1)

    lines = []
    in_line = False
    start = 0
    line_num = 0

    for y, val in enumerate(h_projection):
        if not in_line and val > 0:
            start = y
            in_line = True
        elif in_line and val == 0:
            if y - start > 10:  # Minimum line height
                line_img = binary_image[start:y, :]
                x_nonzero = np.where(np.sum(line_img, axis=0) > 0)[0]
                if len(x_nonzero) > 0:
                    x_start = x_nonzero[0]
                    x_end = x_nonzero[-1]
                    lines.append(TextLine(
                        image=line_img[:, x_start:x_end + 1],
                        bbox=(x_start, start, x_end - x_start, y - start),
                        line_number=line_num,
                    ))
                    line_num += 1
            in_line = False

    return lines

def segment_words(line: TextLine) -> list[Word]:
    """Segment a text line into individual words."""
    v_projection = np.sum(line.image, axis=0)

    words = []
    in_word = False
    start = 0
    word_idx = 0
    gap_threshold = 15  # Pixels between words

    gaps = []
    current_gap = 0

    for x, val in enumerate(v_projection):
        if val == 0:
            current_gap += 1
        else:
            if current_gap > 0:
                gaps.append((x - current_gap, current_gap))
            current_gap = 0

    # Use larger gaps as word boundaries
    if gaps:
        median_gap = np.median([g[1] for g in gaps])
        gap_threshold = max(median_gap * 1.5, 10)

    in_word = False
    for x, val in enumerate(v_projection):
        if not in_word and val > 0:
            start = x
            in_word = True
        elif in_word and val == 0:
            if x - start > 5:
                # Check if next ink is far enough to be a new word
                next_ink = np.argmax(v_projection[x:] > 0) if x < len(v_projection) else 0
                if next_ink > gap_threshold or x == len(v_projection) - 1:
                    word_img = line.image[:, start:x]
                    words.append(Word(
                        image=word_img,
                        bbox=(
                            line.bbox[0] + start,
                            line.bbox[1],
                            x - start,
                            line.bbox[3],
                        ),
                        line_number=line.line_number,
                        word_index=word_idx,
                    ))
                    word_idx += 1
            in_word = False

    return words

Multi-Engine Recognition with Confidence

Use multiple recognition approaches and compare results for higher accuracy:

import pytesseract
from PIL import Image

@dataclass
class RecognitionResult:
    text: str
    confidence: float
    engine: str

def recognize_with_tesseract(
    word_image: np.ndarray,
) -> RecognitionResult:
    """Recognize handwriting using Tesseract HTR mode."""
    pil_img = Image.fromarray(word_image)

    # PSM 8 = single word, OEM 1 = LSTM engine
    data = pytesseract.image_to_data(
        pil_img,
        config="--psm 8 --oem 1",
        output_type=pytesseract.Output.DICT,
    )

    words = [t for t, c in zip(data["text"], data["conf"])
             if t.strip() and int(c) > 0]
    confs = [int(c) / 100.0 for t, c in zip(data["text"], data["conf"])
             if t.strip() and int(c) > 0]

    text = " ".join(words) if words else ""
    conf = sum(confs) / len(confs) if confs else 0.0

    return RecognitionResult(
        text=text, confidence=conf, engine="tesseract"
    )

def recognize_with_vision_llm(
    word_image: np.ndarray,
) -> RecognitionResult:
    """Use a vision LLM for difficult handwriting."""
    import base64

    pil_img = Image.fromarray(word_image)
    import io
    buffer = io.BytesIO()
    pil_img.save(buffer, format="PNG")
    b64_image = base64.b64encode(buffer.getvalue()).decode()

    from openai import OpenAI
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": (
                    "Read the handwritten text in this image. "
                    "Return ONLY the text, nothing else."
                )},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{b64_image}"
                }},
            ]},
        ],
    )

    return RecognitionResult(
        text=response.choices[0].message.content.strip(),
        confidence=0.85,  # Vision LLMs are generally reliable
        engine="gpt-4o-vision",
    )

Confidence-Based Routing

Route results based on confidence to either automated processing or human review:

from enum import Enum

class ReviewDecision(Enum):
    AUTO_ACCEPT = "auto_accept"
    HUMAN_REVIEW = "human_review"
    REJECT = "reject"

def decide_review_route(
    results: list[RecognitionResult],
    high_threshold: float = 0.85,
    low_threshold: float = 0.4,
) -> dict:
    """Decide whether to auto-accept, route for review, or reject."""
    best = max(results, key=lambda r: r.confidence)

    # Check agreement between engines
    texts = [r.text.lower().strip() for r in results if r.text]
    agreement = len(set(texts)) == 1 if texts else False

    if best.confidence >= high_threshold and agreement:
        return {
            "decision": ReviewDecision.AUTO_ACCEPT,
            "text": best.text,
            "confidence": best.confidence,
            "reason": "High confidence with engine agreement",
        }
    elif best.confidence < low_threshold:
        return {
            "decision": ReviewDecision.REJECT,
            "text": best.text,
            "confidence": best.confidence,
            "reason": "Confidence too low for reliable extraction",
        }
    else:
        return {
            "decision": ReviewDecision.HUMAN_REVIEW,
            "text": best.text,
            "confidence": best.confidence,
            "alternatives": [r.text for r in results],
            "reason": "Moderate confidence — needs human verification",
        }

Form Field Extraction

For structured forms, map recognized text to specific fields:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

def extract_form_fields(
    image_path: str,
    field_definitions: list[dict],
) -> dict:
    """Extract named fields from a handwritten form."""
    preprocessed = preprocess_handwriting(image_path)
    results = {}

    for field_def in field_definitions:
        x, y, w, h = field_def["bbox"]
        field_image = preprocessed[y:y+h, x:x+w]

        tesseract_result = recognize_with_tesseract(field_image)

        if tesseract_result.confidence < 0.6:
            vision_result = recognize_with_vision_llm(field_image)
            route = decide_review_route([tesseract_result, vision_result])
        else:
            route = decide_review_route([tesseract_result])

        results[field_def["name"]] = {
            "value": route["text"],
            "confidence": route["confidence"],
            "review_status": route["decision"].value,
        }

    return results

FAQ

How accurate is modern handwriting recognition?

On clean, legible handwriting, modern HTR systems achieve 85-95% character-level accuracy and 75-90% word-level accuracy. Accuracy drops significantly with cursive writing, poor ink quality, or unusual handwriting styles. The key to production reliability is confidence scoring combined with human review for uncertain results rather than trying to achieve perfect automated accuracy.

Should I use Tesseract or a deep learning model for handwriting?

Tesseract LSTM (OEM 1) handles neat handwriting reasonably well and runs locally without GPU. For messy or cursive handwriting, deep learning models like TrOCR (from Microsoft) or vision LLMs significantly outperform Tesseract. The best production approach uses Tesseract as a fast first pass and escalates to a vision LLM only when Tesseract confidence is low.

How do I handle checkboxes and filled circles on handwritten forms?

Checkboxes and radio buttons need a different detection approach than text. Look for the pre-printed checkbox outline using template matching, then analyze the fill level inside the boundary. A filled ratio above 30-40% typically indicates a checked box. For ambiguous cases, use the same human review routing as low-confidence text.


#HandwritingRecognition #HTR #FormProcessing #OCR #HumanInTheLoop #DocumentAI #AgenticAI #Python

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Human-in-the-Loop Hybrid Agents: 73% Fewer Errors in 2026

Fully autonomous agents are still a fantasy in production. LangGraph's interrupt() lets you pause for human approval mid-graph without losing state. We cover approve/edit/reject/respond actions and CallSphere's escalation ladder.

Regulation & Policy

HIPAA + AI — April 2026 OCR guidance for AI agents in healthcare

HHS OCR's April 2026 guidance addresses HIPAA Security Rule expectations for AI agents accessing PHI — including audit logging, access controls, and BAAs for AI vendors.

AI Engineering

Human-in-the-Loop with LangGraph Interrupts: Approve, Edit, Resume

LangGraph 1.0's interrupt() primitive makes human-in-the-loop a one-liner. Approval queues, edit-and-resume, and dynamic break points in production.

Agentic AI

Chat Agents With File Upload and OCR: PDFs, Scans, and Forms in 2026

Mistral OCR, LandingAI, and docAnalyzer push agentic document extraction past 95% accuracy. Here is how 2026 chat agents accept uploads, OCR, and answer with cited spans inline.

Learn Agentic AI

Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Learn how to build an end-to-end document intelligence agent that combines Tesseract OCR, layout detection, zone classification, and structured information extraction to process any document type automatically.

Learn Agentic AI

AI Agent for Automated Data Entry: Reading Source Documents and Filling Web Forms

Build an AI agent that reads source documents using OCR and vision models, maps extracted data to web form fields, fills forms automatically, and validates entries with intelligent error correction.