Skip to content
Learn Agentic AI
Learn Agentic AI13 min read3 views

Screenshot Analysis Agent: Understanding UI Elements and Generating Descriptions

Build a screenshot analysis agent that detects UI elements, analyzes layouts, and generates accessibility descriptions. Learn techniques for button detection, form analysis, and hierarchical layout understanding.

Why Screenshot Analysis Matters for AI Agents

Screenshot analysis is the foundation of computer use agents, automated QA testing, and accessibility tooling. An agent that can look at a screenshot and understand what UI elements are present — buttons, text fields, navigation menus, data tables — can then interact with those elements, verify their correctness, or generate descriptions for users who rely on screen readers.

Setting Up the Agent

pip install openai pillow numpy

The agent combines vision-model analysis with structured output parsing to deliver actionable UI understanding.

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

Detecting UI Elements with Vision Models

Rather than training custom object detection models for every UI framework, modern vision language models can identify UI elements directly from screenshots:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
import openai
import base64
import io
import json
from PIL import Image
from dataclasses import dataclass, field
from pydantic import BaseModel

class UIElement(BaseModel):
    element_type: str  # button, input, link, text, image, etc.
    label: str
    bounding_box: dict  # {x, y, width, height} as percentages
    state: str = "default"  # default, disabled, focused, error
    description: str = ""

class ScreenAnalysis(BaseModel):
    page_type: str  # login, dashboard, form, list, etc.
    elements: list[UIElement]
    layout_description: str
    accessibility_issues: list[str]

async def analyze_screenshot(
    image_bytes: bytes,
    client: openai.AsyncOpenAI,
) -> ScreenAnalysis:
    """Analyze a screenshot and identify all UI elements."""
    b64 = base64.b64encode(image_bytes).decode()

    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a UI analysis expert. Analyze the "
                    "screenshot and identify all interactive and "
                    "informational UI elements. For each element, "
                    "provide its type, label, approximate bounding "
                    "box as percentage coordinates (x, y from "
                    "top-left, width, height), current state, and "
                    "a brief description. Also identify the page "
                    "type, overall layout, and any accessibility "
                    "issues."
                ),
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}"
                        },
                    },
                    {
                        "type": "text",
                        "text": "Analyze this UI screenshot.",
                    },
                ],
            },
        ],
        response_format=ScreenAnalysis,
    )
    return response.choices[0].message.parsed

Layout Analysis: Understanding Spatial Relationships

Beyond identifying individual elements, the agent must understand how elements relate to each other spatially. This is critical for generating meaningful descriptions and for computer use agents that need to navigate layouts:

@dataclass
class LayoutRegion:
    name: str  # header, sidebar, main_content, footer, modal
    elements: list[UIElement]
    bounds: dict  # {x, y, width, height}

def group_elements_by_region(
    elements: list[UIElement],
) -> list[LayoutRegion]:
    """Group UI elements into layout regions based on position."""
    regions = {
        "header": LayoutRegion("header", [], {
            "x": 0, "y": 0, "width": 100, "height": 15
        }),
        "sidebar": LayoutRegion("sidebar", [], {
            "x": 0, "y": 15, "width": 20, "height": 70
        }),
        "main_content": LayoutRegion("main_content", [], {
            "x": 20, "y": 15, "width": 80, "height": 70
        }),
        "footer": LayoutRegion("footer", [], {
            "x": 0, "y": 85, "width": 100, "height": 15
        }),
    }

    for element in elements:
        box = element.bounding_box
        center_x = box.get("x", 0) + box.get("width", 0) / 2
        center_y = box.get("y", 0) + box.get("height", 0) / 2

        assigned = False
        for region in regions.values():
            rb = region.bounds
            if (rb["x"] <= center_x <= rb["x"] + rb["width"]
                    and rb["y"] <= center_y <= rb["y"] + rb["height"]):
                region.elements.append(element)
                assigned = True
                break

        if not assigned:
            regions["main_content"].elements.append(element)

    return [r for r in regions.values() if r.elements]

Generating Accessibility Descriptions

A key application is generating descriptions for accessibility auditing or screen reader content:

def generate_accessibility_description(
    analysis: ScreenAnalysis,
) -> str:
    """Generate an accessibility-oriented description of the UI."""
    regions = group_elements_by_region(analysis.elements)

    lines = [
        f"Page type: {analysis.page_type}",
        f"Layout: {analysis.layout_description}",
        "",
    ]

    for region in regions:
        lines.append(f"## {region.name.replace('_', ' ').title()}")
        for elem in region.elements:
            state_info = (
                f" ({elem.state})" if elem.state != "default" else ""
            )
            lines.append(
                f"- [{elem.element_type}] {elem.label}{state_info}"
            )
            if elem.description:
                lines.append(f"  {elem.description}")
        lines.append("")

    if analysis.accessibility_issues:
        lines.append("## Accessibility Issues")
        for issue in analysis.accessibility_issues:
            lines.append(f"- {issue}")

    return "\n".join(lines)

The Complete Screenshot Agent

class ScreenshotAnalysisAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()
        self.last_analysis: ScreenAnalysis | None = None

    async def analyze(self, image_bytes: bytes) -> dict:
        self.last_analysis = await analyze_screenshot(
            image_bytes, self.client
        )
        description = generate_accessibility_description(
            self.last_analysis
        )
        return {
            "page_type": self.last_analysis.page_type,
            "element_count": len(self.last_analysis.elements),
            "description": description,
            "issues": self.last_analysis.accessibility_issues,
        }

    def find_element(self, label: str) -> UIElement | None:
        """Find a UI element by its label."""
        if not self.last_analysis:
            return None
        label_lower = label.lower()
        for elem in self.last_analysis.elements:
            if label_lower in elem.label.lower():
                return elem
        return None

FAQ

How accurate are vision models at detecting UI elements compared to DOM-based approaches?

Vision models like GPT-4o achieve approximately 85-90% accuracy for common UI element detection, which is sufficient for most use cases. DOM-based approaches are more precise when available, but they require browser access and do not work for native applications, images of UIs, or design mockups. The vision-based approach is universally applicable — it works on any screenshot regardless of the technology behind the UI.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Can this agent handle dynamic UI elements like dropdown menus or modals?

Yes. When a dropdown is open or a modal is visible, those elements appear in the screenshot and the vision model identifies them. For comprehensive analysis of a dynamic page, take multiple screenshots showing different states — the initial state, after clicking a dropdown, after opening a modal — and analyze each separately. The agent can compare analyses to build a complete picture of the UI's interactive behavior.

How do I use this for automated accessibility auditing?

Run the agent on every page of your application and collect the accessibility_issues array from each analysis. Common issues the model identifies include missing alt text on images, low contrast text, unlabeled form fields, and tiny click targets. While this does not replace a full WCAG compliance audit, it catches the most impactful issues quickly and can run as part of a CI pipeline on screenshot snapshots.


#ScreenshotAnalysis #UIDetection #Accessibility #LayoutAnalysis #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Smolagents lets agents write Python instead of JSON. Why code-as-action reduces tool errors and where the security trade-offs are for production deployments.

AI Infrastructure

Deploy a Voice Agent on Modal with Python and Serverless GPU

Modal turns a Python function into autoscaling serverless compute with optional GPU. Deploy a LiveKit Agent with one command and get pay-per-second billing.

AI Engineering

Pydantic AI April 2026 Update: Typed Agents and Structured Tools

Pydantic AI's April release tightens the typed-agent loop and adds structured tool definitions. Why type-safe agents reduce production bugs and speed iteration.

AI Strategy

Retail AI Voice & ADA Effective Communication in 2026

Title III lawsuits against retail digital channels hit a record in 2025. Here is the effective-communication, multi-modal, and consent stack a retail AI voice agent needs to ship in 2026.

AI Voice Agents

Voice Agent for Elderly & Accessibility: Designing for Everyone (2026)

Voice interfaces lift task completion 40%+ for users with motor impairments — but only if speech rate, pause budgets, and feedback patterns adapt. We map ADA-aligned UX and CallSphere's senior-friendly mode.