Speech-to-Text Confidence Thresholds for Production Voice Bots
ASR confidence scores are noisy but usable when calibrated. The 2026 patterns for threshold tuning and confidence-driven UX in voice bots.
What ASR Confidence Is
Production ASR engines (Deepgram, Whisper, AssemblyAI, OpenAI Realtime) emit per-word and per-utterance confidence scores. These are noisy approximations of "is this transcription right." Tuned correctly they drive better voice-bot UX. Tuned poorly they cause clarification loops and frustrated callers.
This piece walks through the 2026 patterns for using ASR confidence well.
Where Confidence Comes From
flowchart LR
Audio[Audio chunks] --> Model[ASR model]
Model --> Tokens[Token probs]
Tokens --> Word[Word confidence]
Tokens --> Utt[Utterance confidence]
Confidence is derived from token-level probabilities. Different providers compute and expose it differently. Calibration varies.
Three Confidence Signals to Use
- Per-word confidence: spot specific terms the ASR is unsure about (names, codes)
- Utterance confidence: overall reliability of the transcript
- Repeated-word agreement: same word transcribed differently across re-listens (rare but useful)
Most production teams use per-word and utterance confidence; repeated-listen is reserved for high-stakes turns.
Thresholds That Work
Calibrated thresholds for typical telephony audio (varies by provider):
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Above 0.85: trust the transcript, proceed
- 0.70-0.85: use but verify high-stakes pieces (names, account numbers)
- 0.50-0.70: ask for confirmation
- Below 0.50: ask for clarification or repeat
These are starting points. Tune to your audio quality and risk tolerance.
What to Do With Low Confidence
flowchart TD
Low[Low confidence] --> Stake{What was the audio?}
Stake -->|Casual chat| Proceed[Proceed best-effort]
Stake -->|Name / ID| Verify[Read it back, ask to confirm]
Stake -->|Money / dates| Verify
Stake -->|Long sentence| Ask[Ask user to repeat]
The right action depends on what the audio was supposed to convey.
Read-Back Patterns
For names, account numbers, and dates, read-back is the standard 2026 pattern:
- "I heard your account number as 4-7-2-1-9. Is that right?"
- "I caught the name as Cassandra; correct?"
- "Just to confirm, that's Tuesday the seventh at 2 pm?"
Read-back catches errors before they propagate to bookings, payments, or records.
Tuning by Audio Quality
flowchart TB
Audio[Audio quality] --> Q1[Studio: thresholds high]
Audio --> Q2[Cell phone: thresholds mid]
Audio --> Q3[Drive-thru: thresholds low]
Audio quality affects what counts as "high" confidence. Drive-thru audio at 0.6 may be the best you can get reliably; treat 0.6 as your "trust" level.
Word-Level vs Utterance-Level
Use both:
- Utterance-level: is the whole turn reliable?
- Word-level: are specific high-stakes words reliable?
A high-utterance-confidence transcript with a low-confidence specific word ("Margaret" vs "Marguerite") still needs read-back of that word.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Custom Vocabularies
Names, product names, and domain terms have low default confidence because they are out-of-distribution. Most ASR providers support:
- Custom vocabulary lists
- Phonetic hints
- Custom-fine-tuned models for domain audio
Investing in vocabulary tuning lifts confidence on the words that matter most.
Pre-Filtering Bad Audio
Before ASR, filter:
- Silence / no-speech (do not waste compute)
- Very short utterances ("uh huh" — keep but score appropriately)
- Hold music, background TV
- Cross-talk
Each saves cost and reduces low-confidence noise.
Confidence in Multilingual Settings
Bilingual or accented speech often produces lower confidence. Patterns:
- Detect language up front; route to language-tuned ASR
- For code-switching, use providers that handle it (Whisper-V4 is among the strongest)
- Adjust thresholds per language; the same threshold may not be right for English and Spanish
What CallSphere Tracks
In production, we monitor:
- Average and p95 ASR confidence per call
- Read-back rate
- Read-back correction rate (how often is the user-heard transcript wrong)
- Escalation rate from low-confidence audio
Each one is a leading indicator of UX quality.
Sources
- Deepgram confidence documentation — https://deepgram.com/learn
- AssemblyAI confidence — https://www.assemblyai.com/docs
- OpenAI Whisper — https://github.com/openai/whisper
- "ASR confidence calibration" research — https://arxiv.org
- Pipecat conversation handling — https://www.pipecat.ai
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.