Building an Image Analysis Agent: OCR, Object Detection, and Visual QA

What an Image Analysis Agent Does

An image analysis agent accepts an image and a natural language question, then uses a combination of computer vision tools — OCR, object detection, and visual question answering — to produce a structured answer. Unlike a simple API call to a vision model, an agent can decide which tools to apply based on the question, chain multiple analysis steps, and format results according to the user's needs.

Setting Up the Vision Toolbox

The agent needs three core capabilities. Start by installing the dependencies:

flowchart LR
    PDF(["PDF or image"])
    OCR["OCR plus layout<br/>LayoutLM or Donut"]
    DETECT["Table detector<br/>bounding boxes"]
    STRUCT["Cell structure<br/>rows and columns"]
    LLM["LLM normalization<br/>headers and types"]
    VAL["Schema validation<br/>Pydantic"]
    DB[(Structured store)]
    OUT(["Clean rows"])
    PDF --> OCR --> DETECT --> STRUCT --> LLM --> VAL --> DB --> OUT
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style VAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OUT fill:#059669,stroke:#047857,color:#fff

pip install openai pillow pytesseract ultralytics

Each tool serves a distinct purpose:

OCR (Tesseract) — extracts text from images, useful for documents, signs, and labels
Object Detection (YOLO) — identifies and locates objects with bounding boxes
Visual QA (GPT-4o) — answers open-ended questions about image content

Image Preprocessing Pipeline

Raw images often need preprocessing before analysis. Resizing, normalization, and format conversion improve accuracy across all tools:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

from PIL import Image, ImageEnhance, ImageFilter
import io

def preprocess_image(
    image_bytes: bytes,
    max_dimension: int = 2048,
    enhance_for_ocr: bool = False,
) -> Image.Image:
    """Preprocess an image for analysis."""
    img = Image.open(io.BytesIO(image_bytes))

    # Convert to RGB if necessary
    if img.mode != "RGB":
        img = img.convert("RGB")

    # Resize if too large (preserves aspect ratio)
    if max(img.size) > max_dimension:
        ratio = max_dimension / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)

    # Enhance for OCR: sharpen and increase contrast
    if enhance_for_ocr:
        img = img.filter(ImageFilter.SHARPEN)
        enhancer = ImageEnhance.Contrast(img)
        img = enhancer.enhance(1.5)

    return img

Building the OCR Tool

Tesseract handles text extraction. Wrap it as an agent tool with structured output:

import pytesseract
from dataclasses import dataclass

@dataclass
class OCRResult:
    full_text: str
    confidence: float
    word_count: int
    blocks: list[dict]

def extract_text(img: Image.Image) -> OCRResult:
    """Extract text from an image using Tesseract OCR."""
    # Get detailed data including confidence scores
    data = pytesseract.image_to_data(
        img, output_type=pytesseract.Output.DICT
    )

    words = []
    confidences = []
    for i, text in enumerate(data["text"]):
        conf = int(data["conf"][i])
        if conf > 0 and text.strip():
            words.append(text.strip())
            confidences.append(conf)

    full_text = " ".join(words)
    avg_confidence = (
        sum(confidences) / len(confidences) if confidences else 0.0
    )

    # Build text blocks by grouping lines
    blocks = []
    current_block = []
    current_block_num = -1
    for i, text in enumerate(data["text"]):
        if not text.strip():
            continue
        block_num = data["block_num"][i]
        if block_num != current_block_num:
            if current_block:
                blocks.append({"text": " ".join(current_block)})
            current_block = [text.strip()]
            current_block_num = block_num
        else:
            current_block.append(text.strip())
    if current_block:
        blocks.append({"text": " ".join(current_block)})

    return OCRResult(
        full_text=full_text,
        confidence=avg_confidence,
        word_count=len(words),
        blocks=blocks,
    )

Object Detection with YOLO

The YOLO model identifies objects and their locations within an image:

from ultralytics import YOLO

@dataclass
class DetectedObject:
    label: str
    confidence: float
    bbox: tuple[int, int, int, int]  # x1, y1, x2, y2

def detect_objects(
    img: Image.Image, confidence_threshold: float = 0.5
) -> list[DetectedObject]:
    """Detect objects in an image using YOLOv8."""
    model = YOLO("yolov8n.pt")  # nano model for speed
    results = model(img, verbose=False)

    detected = []
    for result in results:
        for box in result.boxes:
            conf = float(box.conf[0])
            if conf >= confidence_threshold:
                x1, y1, x2, y2 = box.xyxy[0].tolist()
                label = result.names[int(box.cls[0])]
                detected.append(DetectedObject(
                    label=label,
                    confidence=round(conf, 3),
                    bbox=(int(x1), int(y1), int(x2), int(y2)),
                ))
    return detected

The Agent: Routing Questions to Tools

The agent decides which tools to use based on the user's question. A keyword-based router works well for most cases:

import openai
import base64

class ImageAnalysisAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    def _select_tools(self, question: str) -> list[str]:
        """Select which tools to run based on the question."""
        q = question.lower()
        tools = []
        if any(kw in q for kw in ["text", "read", "ocr", "written", "says"]):
            tools.append("ocr")
        if any(kw in q for kw in ["object", "detect", "find", "count", "how many"]):
            tools.append("detection")
        # Always include VQA as the reasoning backbone
        tools.append("vqa")
        return tools

    async def analyze(
        self, image_bytes: bytes, question: str
    ) -> dict:
        selected_tools = self._select_tools(question)
        context_parts = []

        img = preprocess_image(image_bytes)

        if "ocr" in selected_tools:
            ocr_result = extract_text(
                preprocess_image(image_bytes, enhance_for_ocr=True)
            )
            context_parts.append(
                f"OCR extracted text ({ocr_result.word_count} words, "
                f"confidence {ocr_result.confidence:.1f}%): "
                f"{ocr_result.full_text}"
            )

        if "detection" in selected_tools:
            objects = detect_objects(img)
            obj_summary = ", ".join(
                f"{o.label} ({o.confidence:.0%})" for o in objects
            )
            context_parts.append(
                f"Detected objects: {obj_summary or 'none'}"
            )

        # VQA with GPT-4o, enriched by tool outputs
        b64 = base64.b64encode(image_bytes).decode()
        tool_context = "\n".join(context_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": (
                            f"Tool analysis results:\n{tool_context}\n\n"
                            f"Question: {question}"
                        ),
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{b64}"},
                    },
                ],
            }],
        )
        return {
            "answer": response.choices[0].message.content,
            "tools_used": selected_tools,
        }

Structured Output Formatting

For programmatic consumers, format the analysis results as structured JSON:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

from pydantic import BaseModel

class ImageAnalysisResult(BaseModel):
    answer: str
    extracted_text: str | None = None
    detected_objects: list[dict] | None = None
    tools_used: list[str]
    confidence: float

FAQ

When should I use OCR versus a vision language model for text extraction?

Use Tesseract OCR when you need precise character-level extraction from clean documents, invoices, or printed text. Use a vision language model like GPT-4o when the text is embedded in complex scenes, handwritten, or when you also need to understand the context around the text. For best results, run both and let the agent cross-reference the outputs.

How do I handle images that are too large for the API?

Resize images to a maximum dimension of 2048 pixels while preserving the aspect ratio, as shown in the preprocessing function. For GPT-4o specifically, the API automatically handles resizing, but sending smaller images reduces latency and cost. If detail is critical for a specific region, crop that region and send it as a separate analysis request.

Can this agent process multiple images in a single request?

Yes. Extend the analyze method to accept a list of image bytes. Process each image independently through the tool pipeline, then send all results along with all images to the VQA step. GPT-4o supports multiple images in a single message, so the reasoning model can compare and cross-reference across images.

#ImageAnalysis #OCR #ObjectDetection #VisualQA #ComputerVision #AgenticAI #LearnAI #AIEngineering

Building an Image Analysis Agent: OCR, Object Detection, and Visual QA

What an Image Analysis Agent Does

Setting Up the Vision Toolbox

Image Preprocessing Pipeline

Building the OCR Tool

Object Detection with YOLO

The Agent: Routing Questions to Tools

Structured Output Formatting

FAQ

When should I use OCR versus a vision language model for text extraction?

How do I handle images that are too large for the API?

Can this agent process multiple images in a single request?

Try CallSphere AI Voice Agents

Related Articles You May Like

HIPAA + AI — April 2026 OCR guidance for AI agents in healthcare

Chat Agents With File Upload and OCR: PDFs, Scans, and Forms in 2026

UFO's Visual Understanding: How GPT-4V Interprets Windows Application Screenshots

Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Building a Floor Plan Analysis Agent: Room Detection, Measurement, and Description

AI Agent for Automated Data Entry: Reading Source Documents and Filling Web Forms