Skip to content
Technical Guides
Technical Guides14 min read5 views

Voicemail Detection Accuracy: CallSphere vs Vapi (with Examples)

Voicemail detection accuracy makes or breaks outbound voice AI. CallSphere VoicemailAnalyzerAgent + Twilio AMD vs Vapi defaults. Real call examples included.

TL;DR

Voicemail detection (AMD - Answering Machine Detection) is the single biggest predictor of outbound campaign quality. False negatives (treating voicemail as a human) burn message budget and look spammy; false positives (treating humans as voicemail) make you look broken. Vapi uses provider-default AMD with limited customization. CallSphere uses a two-stage cascade: Twilio AMD signals first, then a VoicemailAnalyzerAgent built on gpt-4o-mini that listens to the first 4 seconds and confirms voicemail vs human with structured reasoning.

In production traffic across After-Hours dispatch, the cascade lands at ~96% accuracy vs ~83% for AMD-only.

Why Voicemail Detection Is Hard

The naive heuristic — "wait for the beep" — fails because:

  • People answer with long greetings ("Hello? Hi, this is John, who is this?")
  • Voicemail systems have variable pre-beep delays (1.5s to 8s)
  • Some voicemails skip the beep entirely
  • Mobile carriers compress audio differently
  • Background noise on humans imitates voicemail tone shifts

A single signal source is never enough. Production systems cascade.

Vapi Voicemail Detection Approach

Vapi exposes a config block:

{
  "voicemailDetection": {
    "provider": "twilio",
    "enabled": true,
    "machineDetectionTimeout": 30,
    "machineDetectionSpeechThreshold": 2400,
    "machineDetectionSpeechEndThreshold": 1200,
    "machineDetectionSilenceTimeout": 5000
  }
}

This delegates to Twilio's AMD plus Vapi's own assistant-side hint detection. The thresholds are exposed but the assistant logic is opaque.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Strengths: sane defaults work for most simple use cases.

Weaknesses:

  • No second-pass LLM verification
  • No way to inject domain knowledge ("this customer's voicemail says X")
  • Hard to debug a false-positive
  • Action on detection is binary (leave message / hang up)

CallSphere Voicemail Detection Approach

CallSphere uses a three-stage cascade:

  1. Twilio AMD runs in parallel with the call connect, returning AnsweredBy within ~2-3s
  2. Audio fingerprint — first 1.5s of audio is matched against known voicemail intro patterns (regional carrier specifics)
  3. VoicemailAnalyzerAgent — a gpt-4o-mini agent listens to the first 4 seconds of transcript + audio features and returns {is_voicemail: bool, confidence: float, reasoning: string}

The decision is a weighted vote.

Twilio AMD Configuration

client.calls.create(
    to=lead.phone,
    from_=campaign.caller_id,
    url=callback_url,
    machine_detection="DetectMessageEnd",  # waits for greeting end
    async_amd=True,                         # don't block call connect
    async_amd_status_callback=amd_callback_url,
    machine_detection_timeout=30,
    machine_detection_speech_threshold=2400,
    machine_detection_speech_end_threshold=1200,
    machine_detection_silence_timeout=5000,
)

DetectMessageEnd waits for the voicemail greeting to finish — important if you want to leave a message after the beep.

VoicemailAnalyzerAgent

The second-pass agent is intentionally cheap (gpt-4o-mini) and structured:

voicemail_analyzer = Agent(
    name="VoicemailAnalyzerAgent",
    model="gpt-4o-mini",
    instructions="""You analyze the first 4 seconds of an outbound call.
    Return strict JSON.

    Voicemail signals:
    - "You've reached the voicemail of..."
    - "I'm not available right now..."
    - "Please leave a message after the tone"
    - Long uninterrupted single voice >3s
    - "Please record your message"

    Human signals:
    - Question response: "Hello?" "Who is this?"
    - Short utterance under 2s with rising intonation
    - Background noise + brief greeting
    - Conversational hesitation: "Uh, hi?"

    Return: {"is_voicemail": bool, "confidence": 0.0-1.0, "reasoning": "..."}
    """,
    output_type=VoicemailVerdict,
)

Cascade Logic

async def detect_voicemail(call: OutboundCall) -> Verdict:
    twilio_signal = await call.amd_signal_within(2.5)
    audio_fingerprint = await call.audio_fingerprint_first_1500ms()

    if twilio_signal == "machine_start" and audio_fingerprint.match_voicemail:
        return Verdict.VOICEMAIL  # high confidence, skip LLM

    if twilio_signal == "human" and audio_fingerprint.match_human:
        return Verdict.HUMAN  # high confidence, skip LLM

    # Ambiguous — escalate to LLM
    transcript = await call.transcript_first_4s()
    audio_features = await call.audio_features_first_4s()
    verdict = await voicemail_analyzer.run({
        "transcript": transcript,
        "audio_features": audio_features.dict(),
        "twilio_amd": twilio_signal,
    })

    if verdict.confidence < 0.65:
        return Verdict.UNCERTAIN  # treat as human, log for review

    return Verdict.VOICEMAIL if verdict.is_voicemail else Verdict.HUMAN

The cascade only invokes the LLM (~$0.0002/call) when Twilio + fingerprint disagree, which is roughly 12% of calls. Net cost overhead is negligible.

Real-World Examples

Three calls from a recent campaign (sanitized):

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Call A — clear voicemail

  • Audio: "You've reached Sarah Williams. I'm not available..."
  • Twilio AMD: machine_start
  • Fingerprint: voicemail match
  • LLM: not invoked
  • Verdict: VOICEMAIL ✓

Call B — ambiguous

  • Audio: "Hello, this is the answering service for Dr. Patel's office, please wait..."
  • Twilio AMD: human (false positive due to "this is" framing)
  • Fingerprint: weak voicemail
  • LLM verdict: {is_voicemail: true, confidence: 0.81, reasoning: "answering service phrasing"}
  • Final: VOICEMAIL ✓ (cascade saved a wasted message)

Call C — long human greeting

  • Audio: "Hi! I'm so glad you called. Just one second, let me find a quieter spot..."
  • Twilio AMD: machine_start (false positive due to length)
  • Fingerprint: weak human
  • LLM verdict: {is_voicemail: false, confidence: 0.92, reasoning: "second person address, conversational"}
  • Final: HUMAN ✓ (cascade saved an awkward "leave a message")

Vapi vs CallSphere Voicemail Detection Comparison

Metric Vapi CallSphere
Detection signals Twilio AMD + provider hints Twilio AMD + audio fingerprint + LLM
LLM second-pass No Yes (gpt-4o-mini)
Production accuracy (campaign) ~83% ~96%
Cost per detection Bundled +$0.0002 LLM cost on ambiguous
Custom voicemail rules Limited Full LLM prompt + fingerprint config
Action on detection Leave message or hang up Leave message, hang up, replay tomorrow, send SMS
Inspectability Vapi log Per-call cascade trace + reasoning

Detection Cascade Diagram

graph TD
    Start[Outbound call connects] --> Twilio[Twilio AMD<br/>2.5s window]
    Start --> FP[Audio fingerprint<br/>1.5s window]
    Twilio --> Agree{Both agree?}
    FP --> Agree
    Agree -->|yes voicemail| VM[Verdict: VOICEMAIL<br/>cost: $0]
    Agree -->|yes human| H[Verdict: HUMAN<br/>cost: $0]
    Agree -->|disagree| LLM[VoicemailAnalyzerAgent<br/>gpt-4o-mini, 4s transcript]
    LLM --> Conf{conf > 0.65?}
    Conf -->|yes voicemail| VM2[Verdict: VOICEMAIL]
    Conf -->|yes human| H2[Verdict: HUMAN]
    Conf -->|no| U[Verdict: UNCERTAIN<br/>treat as human, log]
    VM --> Action{Leave msg?}
    VM2 --> Action
    Action -->|yes| Beep[Wait beep, deliver SMS-ready msg]
    Action -->|no| Hangup[Hang up, retry tomorrow]
    H --> Live[Run human conversation flow]
    H2 --> Live
    U --> Live

Practical Tips

  • Cascade > single signal. Always.
  • Use DetectMessageEnd, not Enable. Enable returns too early.
  • Log the LLM reasoning. When detection disagrees with reality, the reasoning tells you what to fix.
  • Per-region tuning. Audio fingerprints differ by carrier and region; ship a per-region config map.
  • Recheck weekly. Voicemail patterns drift as carriers update prompts.

FAQ

Does the LLM second pass slow down the call?

Slightly — about 250-400ms on top of Twilio's 2.5s window. For outbound, this is invisible because the agent isn't speaking yet.

Can I customize the voicemail message left?

Yes — CallSphere After-Hours flows include a per-campaign voicemail script tool, so the left message reflects the call purpose.

What is the inbound counterpart?

Inbound rarely needs voicemail detection (the user is calling you), but the same cascade detects "you have reached an answering service for X" loops if you transfer.

How often does the LLM disagree with Twilio?

About 12% of ambiguous cases land on the LLM, of which ~30% flip the verdict. Net: ~3.5% of all calls have their verdict corrected by the LLM second pass.

What about regional/non-English voicemail?

The LLM prompt is multilingual; we ship Spanish-language voicemail patterns by default and add per-region configs as needed.

See It Live

The /features page lists per-vertical voicemail handling, and /demo includes an outbound test that triggers the full cascade you can inspect.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Engineering

Latency Benchmarking AI Voice Agent Vendors (2026)

Vapi 465ms optimal, Retell 580-620ms, Bland ~800ms, ElevenLabs 400-600ms — but those are best-case. We design a fair benchmark harness, P95 measurement, and a reproducible methodology for 2026.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.