Skip to content
Learn Agentic AI
Learn Agentic AI12 min read5 views

Speech-to-Text for AI Agents: Comparing Whisper, Deepgram, and AssemblyAI

A practical comparison of the three leading STT engines for voice AI agents — OpenAI Whisper, Deepgram, and AssemblyAI — covering accuracy, latency, streaming capabilities, language support, and pricing.

Why STT Choice Matters for Voice Agents

The speech-to-text engine is the entry point for every voice AI agent. If transcription is slow, the entire pipeline stalls. If it is inaccurate, the language model receives garbled input and produces irrelevant responses. Choosing the right STT provider is one of the most consequential decisions in voice agent architecture.

This guide compares three production-grade options: OpenAI Whisper (self-hosted), Deepgram Nova, and AssemblyAI Universal. Each excels in different scenarios.

OpenAI Whisper: The Open-Source Powerhouse

Whisper is an open-source model from OpenAI trained on 680,000 hours of multilingual audio. It runs locally or via the OpenAI API, giving you full control over cost and privacy.

flowchart TD
    Q{"What matters most<br/>for your team?"}
    DIM1["Time to first<br/>production deploy"]
    DIM2["Total cost of<br/>ownership at scale"]
    DIM3["Debuggability and<br/>observability"]
    DIM4["Ecosystem and<br/>community support"]
    PICK{Score the<br/>four axes}
    A(["Pick<br/>Option A"])
    B(["Pick<br/>Option B"])
    Q --> DIM1 --> PICK
    Q --> DIM2 --> PICK
    Q --> DIM3 --> PICK
    Q --> DIM4 --> PICK
    PICK -->|Speed and ecosystem| A
    PICK -->|Control and TCO| B
    style Q fill:#4f46e5,stroke:#4338ca,color:#fff
    style PICK fill:#f59e0b,stroke:#d97706,color:#1f2937
    style A fill:#0ea5e9,stroke:#0369a1,color:#fff
    style B fill:#059669,stroke:#047857,color:#fff
import whisper
import numpy as np

class WhisperSTT:
    def __init__(self, model_size: str = "base"):
        # Model sizes: tiny, base, small, medium, large-v3
        self.model = whisper.load_model(model_size)

    def transcribe_file(self, audio_path: str) -> dict:
        result = self.model.transcribe(
            audio_path,
            language="en",
            fp16=True,           # Use half precision on GPU
            condition_on_previous_text=True,
        )
        return {
            "text": result["text"],
            "segments": result["segments"],
            "language": result["language"],
        }

    def transcribe_array(self, audio_array: np.ndarray) -> str:
        """Transcribe raw audio from a NumPy array (16kHz mono)."""
        result = self.model.transcribe(audio_array)
        return result["text"]

# Usage
stt = WhisperSTT("small")
result = stt.transcribe_file("call_recording.wav")
print(result["text"])

Strengths: Free when self-hosted, excellent accuracy on clean audio, supports 99 languages, full data privacy. Weaknesses: No native streaming support (batch-only), requires GPU for real-time performance, higher latency than cloud APIs.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

For real-time agents, you can use faster-whisper, a CTranslate2 port that runs 4x faster than the original:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5, vad_filter=True)

for segment in segments:
    print(f"[{segment.start:.1f}s -> {segment.end:.1f}s] {segment.text}")

Deepgram Nova: Built for Real-Time

Deepgram Nova-2 is purpose-built for low-latency streaming transcription. It consistently achieves the fastest time-to-first-transcript among cloud providers.

from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
import asyncio

class DeepgramSTT:
    def __init__(self, api_key: str):
        self.client = DeepgramClient(api_key)

    async def stream_microphone(self, callback):
        connection = self.client.listen.asynclive.v("1")

        async def on_transcript(self, result, **kwargs):
            alt = result.channel.alternatives[0]
            if alt.transcript:
                callback(
                    text=alt.transcript,
                    is_final=result.is_final,
                    confidence=alt.confidence,
                    words=alt.words,
                )

        connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

        options = LiveOptions(
            model="nova-2",
            language="en-US",
            smart_format=True,       # Auto punctuation and formatting
            diarize=True,            # Speaker identification
            interim_results=True,
            endpointing=300,
            filler_words=False,      # Remove "um", "uh"
            utterance_end_ms=1000,
        )

        await connection.start(options)
        return connection

# Usage
stt = DeepgramSTT("your-api-key")

def handle_transcript(text, is_final, confidence, words):
    prefix = "FINAL" if is_final else "INTERIM"
    print(f"[{prefix}] ({confidence:.2f}) {text}")

asyncio.run(stt.stream_microphone(handle_transcript))

Strengths: Sub-200ms streaming latency, built-in diarization, smart formatting, excellent for real-time agents. Weaknesses: Cloud-only (no self-hosted option), cost scales with usage.

AssemblyAI Universal: Best-in-Class Accuracy

AssemblyAI Universal-2 leads accuracy benchmarks, especially on noisy audio, accented speech, and domain-specific vocabulary.

import assemblyai as aai

class AssemblyAISTT:
    def __init__(self, api_key: str):
        aai.settings.api_key = api_key

    def transcribe_with_analysis(self, audio_url: str) -> dict:
        config = aai.TranscriptionConfig(
            speech_model=aai.SpeechModel.best,
            speaker_labels=True,
            auto_highlights=True,
            sentiment_analysis=True,
            entity_detection=True,
        )

        transcriber = aai.Transcriber()
        transcript = transcriber.transcribe(audio_url, config=config)

        return {
            "text": transcript.text,
            "utterances": [
                {"speaker": u.speaker, "text": u.text}
                for u in transcript.utterances
            ],
            "sentiment": transcript.sentiment_analysis,
            "entities": transcript.entities,
        }

    def stream_realtime(self, on_data):
        transcriber = aai.RealtimeTranscriber(
            sample_rate=16000,
            on_data=on_data,
            on_error=lambda e: print(f"Error: {e}"),
        )
        transcriber.connect()
        return transcriber

Strengths: Highest accuracy on difficult audio, built-in NLU features (sentiment, entity detection, summarization), excellent streaming. Weaknesses: Higher per-minute pricing, fewer language options than Whisper.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Comparison Matrix

Feature Whisper (self-hosted) Deepgram Nova-2 AssemblyAI Universal-2
Streaming No (batch only) Yes (sub-200ms) Yes (sub-300ms)
WER (clean audio) ~5% ~6% ~4.5%
Languages 99 36 20+
Self-hosted Yes No No
Diarization No (needs addon) Built-in Built-in
Price Free (GPU cost) $0.0043/min $0.0062/min

Choosing the Right Engine

For real-time voice agents where latency is critical, Deepgram Nova-2 is the strongest choice. For offline processing or when data privacy is paramount, self-hosted Whisper with faster-whisper gives you full control. For high-accuracy scenarios with challenging audio (call centers, medical transcription), AssemblyAI leads on accuracy benchmarks.

FAQ

Can I combine multiple STT engines for better results?

Yes, a common production pattern is to use Deepgram for real-time streaming during the conversation (optimizing for speed) and then re-transcribe the full recording with AssemblyAI or Whisper large-v3 afterward for analytics and compliance. This gives you the best of both worlds.

How do I handle background noise and accents?

All three engines handle moderate noise well, but preprocessing helps. Apply noise reduction before sending audio to the STT engine. For accents, AssemblyAI consistently performs best. You can also fine-tune Whisper on domain-specific audio data to improve accuracy for your specific use case.

What sample rate and format should I send audio in?

For all three providers, 16kHz mono PCM (linear16) is the standard. Higher sample rates like 48kHz do not improve accuracy and waste bandwidth. If your source audio is stereo, mix it to mono before sending.


#SpeechtoText #Whisper #Deepgram #AssemblyAI #VoiceAI #STT #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Engineering

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.