Skip to content
Learn Agentic AI
Learn Agentic AI12 min read8 views

OpenAI Vision API: Building Applications That Understand Images

Learn how to use OpenAI's Vision API to analyze images, send base64-encoded and URL-based images, build multi-modal prompts, and create practical image understanding applications.

What Is the Vision API?

OpenAI's Vision API lets you send images alongside text to models like GPT-4o and receive intelligent analysis, descriptions, or data extraction based on the visual content. The model can read text in images, describe scenes, analyze charts, identify objects, compare images, and answer questions about visual content.

This capability unlocks applications that were previously impossible with text-only models: document processing, visual QA systems, accessibility tools, UI analysis, and more.

Sending an Image via URL

The simplest approach is to pass a publicly accessible image URL:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image? Describe it in detail."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg",
                    },
                },
            ],
        },
    ],
)

print(response.choices[0].message.content)

Notice that the content field is now an array of content parts, mixing text and image inputs. This is the multi-modal message format.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Sending Base64-Encoded Images

For local files or dynamically generated images, encode them as base64:

import base64
from openai import OpenAI

client = OpenAI()

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

image_data = encode_image("screenshot.png")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Extract all text visible in this screenshot."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{image_data}",
                    },
                },
            ],
        },
    ],
)

print(response.choices[0].message.content)

Supported formats include PNG, JPEG, GIF (first frame), and WebP. The data URL must include the correct MIME type.

Controlling Image Detail Level

The detail parameter controls how the model processes the image:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Is this a cat or a dog?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/pet.jpg",
                        "detail": "low",  # or "high" or "auto"
                    },
                },
            ],
        },
    ],
)
  • low — Uses a fixed 512x512 thumbnail. Fastest and cheapest. Good for simple classification tasks.
  • high — Processes the full-resolution image with multiple crops. Best for reading small text, analyzing details, or complex visual tasks.
  • auto (default) — The model decides based on the image size and content.

Multiple Images in One Request

Send several images for comparison or batch analysis:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two UI designs. Which one has better visual hierarchy and why?"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/design_a.png"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/design_b.png"},
                },
            ],
        },
    ],
)

print(response.choices[0].message.content)

Practical Example: Document Data Extraction

Combine vision with structured outputs to extract data from images of forms, receipts, or documents:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

import base64
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class ReceiptData(BaseModel):
    store_name: str
    date: str
    items: list[dict]
    subtotal: float
    tax: float
    total: float
    payment_method: str

def extract_receipt(image_path: str) -> ReceiptData:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Extract all information from this receipt image into structured data.",
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Parse this receipt."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{image_data}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        response_format=ReceiptData,
    )

    return response.choices[0].message.parsed

receipt = extract_receipt("receipt.jpg")
print(f"Store: {receipt.store_name}")
print(f"Total: ${receipt.total:.2f}")

Building an Accessibility Description Generator

Use vision to create alt text for images automatically:

import base64
from openai import OpenAI

client = OpenAI()

def generate_alt_text(image_path: str) -> str:
    with open(image_path, "rb") as f:
        image_data = base64.standard_b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Generate concise, descriptive alt text for web accessibility. "
                           "Focus on the key visual content and context. Keep it under 125 characters.",
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Generate alt text for this image."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_data}",
                            "detail": "low",
                        },
                    },
                ],
            },
        ],
        max_tokens=100,
    )

    return response.choices[0].message.content

alt = generate_alt_text("hero-banner.png")
print(f'<img src="hero-banner.png" alt="{alt}" />')

FAQ

What is the maximum image size I can send?

OpenAI accepts images up to 20MB each. For base64-encoded images, the encoded string will be approximately 33% larger than the original file. If your image is too large, resize it before sending — the model works well with images in the 1024x1024 to 2048x2048 range.

How are images counted toward the token limit?

Images consume tokens based on their resolution and detail setting. A low detail image costs a fixed 85 tokens. A high detail image is split into 512x512 tiles, each costing 170 tokens, plus a base 85 tokens. A 2048x2048 high-detail image costs around 765 tokens.

Can the model generate images or only analyze them?

The Chat Completions API with vision is analysis-only — it understands images but does not create them. For image generation, use the DALL-E API via client.images.generate().


#OpenAI #VisionAPI #MultiModal #ImageAnalysis #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like