Skip to content
Technical Guides
Technical Guides13 min read4 views

Vision-Capable Voice Agents (Property Photos): CallSphere vs Vapi

How CallSphere Real Estate uses GPT-4o vision on buyer-uploaded property photos during voice calls. Vapi is voice-only — what that means in practice.

TL;DR

Vapi is voice-only — no native vision, no image-aware tool, no ability to ground a voice answer in a photo a caller just uploaded. CallSphere ships a vision-capable Property Search specialist in the Real Estate vertical that accepts buyer-uploaded photos via SMS/MMS or web link, runs GPT-4o vision analysis, and feeds structured visual features into the conversation.

This unlocks "find me a kitchen that looks like this one" as a real product, not a vaporware demo.

Why Voice + Vision Together

Most voice AI platforms are text-token-stream-to-audio pipelines. Vision is missing because the original product surface (phone calls) didn't have it. But customer expectations have moved:

  • A buyer texts a Zillow listing, then calls about "the one with the white kitchen"
  • A homeowner snaps a photo of a leaking pipe and calls plumbing dispatch
  • An insurance claimant photographs damage on a roadside call

In all three, the vision artifact is the central context. A voice-only agent has to fall back to "describe the photo to me," which is a worse experience than the human alternative.

Vapi's Vision Story

Vapi as of 2026-04 has:

  • No native multimodal input
  • No image upload primitive
  • No vision tool
  • Workaround: send the image to your own backend, run vision externally, return a text description, feed that text to Vapi as context

The workaround works for "describe an image and tell the agent" but loses two things:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  1. Latency — the round-trip to your vision service plus the agent's next turn is 1-2s extra
  2. Grounding — the agent reasons over a text description, not the actual image, so any nuance the description misses is gone forever

CallSphere Vision Approach

CallSphere's Real Estate Property Search specialist accepts photos via:

  • MMS through Twilio during the call ("text us the photo at this number")
  • Web link entered into a portal ("upload at callsphere.example/upload?call=...")
  • Returning user photo history pulled from Postgres on caller-ID match

The flow:

  1. User says "I want a kitchen like the one I just texted you"
  2. Twilio MMS webhook stores the image in S3, emits a photo_received event tagged with the active call ID
  3. The agent sees a photo_available signal in its context and calls vision_analyze
  4. vision_analyze invokes GPT-4o with the image plus a structured prompt: "Extract: cabinet color, countertop material, layout type, ceiling height estimate, lighting style, square footage estimate"
  5. Returns structured JSON {cabinet_color: "white", countertop: "marble", layout: "galley", ...}
  6. Agent calls search_listings with the structured features as filters
  7. Agent verbally summarizes matches: "I found 4 listings with white cabinets and marble countertops in your search area"

Tool Schema

export const visionAnalyzeTool = {
  type: 'function' as const,
  name: 'vision_analyze',
  description:
    'Analyze a photo the buyer uploaded during this call. Returns structured ' +
    'features that can be passed to search_listings. Only call after photo_available.',
  parameters: {
    type: 'object',
    properties: {
      photo_id: {
        type: 'string',
        description: 'ID from the photo_available event in conversation context',
      },
      analysis_focus: {
        type: 'string',
        enum: ['kitchen', 'bathroom', 'exterior', 'living_space', 'general'],
        description: 'Hint to the vision model on what features matter most',
      },
    },
    required: ['photo_id', 'analysis_focus'],
  },
};

The Vision Prompt

The prompt the agent ships to GPT-4o is intentionally narrow:

You are a property feature extractor. Given the image and the focus area
({analysis_focus}), return strict JSON with these keys ONLY:

  cabinet_color: string | null
  countertop_material: string | null
  flooring_material: string | null
  layout_type: string | null  // e.g., "galley", "open", "u-shape"
  lighting_style: string | null  // e.g., "pendant", "recessed", "natural"
  estimated_sqft: number | null  // null if not estimable
  notable_features: string[]  // max 5

Do not return prose. Do not add keys. Use null for unknown.

The strict-JSON contract is enforced via OpenAI's structured output. A failure here returns null fields, which the agent handles gracefully ("I could see the kitchen but couldn't make out the countertop material — can you tell me?").

The structured features become filters:

features = await vision_analyze(photo_id, focus="kitchen")
matches = await search_listings(
    city=ctx.user_filters.city,
    beds=ctx.user_filters.beds,
    feature_filters={
        "kitchen.cabinet_color": features.cabinet_color,
        "kitchen.countertop": features.countertop_material,
    },
    sort_by="visual_similarity",
)

The visual_similarity sort ranks listings by embedding distance to the buyer's photo using a CLIP-style listing image embedding stored on each property record.

Vapi vs CallSphere Vision Comparison

Dimension Vapi CallSphere
Native vision No Yes (GPT-4o)
Image input channel Out-of-band, DIY MMS, web link, history
Latency to first vision answer 1-2s extra (external) 600-900ms inline
Grounding Text description proxy Direct image reasoning
Structured output DIY parsing OpenAI structured output
Multi-image conversation Awkward Native; agent tracks photo set
Privacy Image touches 2 vendors Image touches OpenAI only
Use case fit Voice-only Voice + visual context

Vision-Enriched Search Flow

sequenceDiagram
    participant Buyer
    participant Twilio
    participant Agent as Property Search Agent
    participant Vision as GPT-4o Vision
    participant DB as Listings DB

    Buyer->>Agent: "I want a kitchen like this"
    Agent->>Buyer: "Text the photo to (415) 555-0123"
    Buyer->>Twilio: MMS with photo
    Twilio->>Agent: photo_received event
    Agent->>Agent: photo_available signal in context
    Agent->>Vision: vision_analyze(photo_id, focus=kitchen)
    Vision-->>Agent: { cabinet_color: "white", countertop: "marble", ... }
    Agent->>DB: search_listings(city, beds, feature_filters)
    DB-->>Agent: 4 matches sorted by visual_similarity
    Agent->>Buyer: "Found 4 with white cabinets, marble counters in your area"
    Buyer->>Agent: "Tell me about the second one"
    Agent->>DB: get_listing_details(id)
    Agent->>Buyer: "1247 Maple Ave, 3 bed 2 bath..."

Other Vertical Use Cases

The vision primitive in CallSphere generalizes:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • Insurance — claimant texts photo of damage, agent extracts severity, auto-routes to adjuster
  • Healthcare — patient texts photo of rash or wound, triage agent classifies urgency (with PHI controls)
  • Field service — technician texts photo of broken part, dispatch agent identifies SKU and ETA

Each is a thin variant of the Real Estate pattern.

Privacy and Security

  • Photos are stored in tenant-isolated S3 buckets with bucket-level encryption
  • Default retention 30 days, configurable
  • Healthcare deployments use a HIPAA-compliant variant with shorter retention and BAA coverage
  • The agent never narrates the image content beyond what is needed to answer; the full image never enters audio output

FAQ

Does vision_analyze block the conversation?

No — the agent emits filler audio ("let me look at that photo") while the vision call runs. Total perceived gap is ~1s.

What if the buyer sends a non-property photo (selfie, etc.)?

The structured prompt returns mostly nulls, and the agent gracefully says "that doesn't look like a property photo — can you check?"

Can vision be used on the LLM's own outputs?

Yes — for QA, we run a vision pass on screenshots of search results to verify they match the agent's verbal description.

Is multi-image conversation supported?

Yes. The agent tracks a photo set for the call and can compare ("this kitchen vs the one you sent first").

Is this MMS-only, or can it work over WhatsApp?

WhatsApp Business is on the roadmap; SMS/MMS via Twilio is shipping.

See the Vision Demo

The /industries/real-estate page has a working video of the kitchen-photo flow, and /demo lets you trigger it live.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Engineering

Latency Benchmarking AI Voice Agent Vendors (2026)

Vapi 465ms optimal, Retell 580-620ms, Bland ~800ms, ElevenLabs 400-600ms — but those are best-case. We design a fair benchmark harness, P95 measurement, and a reproducible methodology for 2026.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.