Speech-to-Text for AI Agents: Comparing Whisper, Deepgram, and AssemblyAI

Why STT Choice Matters for Voice Agents

The speech-to-text engine is the entry point for every voice AI agent. If transcription is slow, the entire pipeline stalls. If it is inaccurate, the language model receives garbled input and produces irrelevant responses. Choosing the right STT provider is one of the most consequential decisions in voice agent architecture.

This guide compares three production-grade options: OpenAI Whisper (self-hosted), Deepgram Nova, and AssemblyAI Universal. Each excels in different scenarios.

OpenAI Whisper: The Open-Source Powerhouse

Whisper is an open-source model from OpenAI trained on 680,000 hours of multilingual audio. It runs locally or via the OpenAI API, giving you full control over cost and privacy.

flowchart TD
    Q{"What matters most<br/>for your team?"}
    DIM1["Time to first<br/>production deploy"]
    DIM2["Total cost of<br/>ownership at scale"]
    DIM3["Debuggability and<br/>observability"]
    DIM4["Ecosystem and<br/>community support"]
    PICK{Score the<br/>four axes}
    A(["Pick<br/>Option A"])
    B(["Pick<br/>Option B"])
    Q --> DIM1 --> PICK
    Q --> DIM2 --> PICK
    Q --> DIM3 --> PICK
    Q --> DIM4 --> PICK
    PICK -->|Speed and ecosystem| A
    PICK -->|Control and TCO| B
    style Q fill:#4f46e5,stroke:#4338ca,color:#fff
    style PICK fill:#f59e0b,stroke:#d97706,color:#1f2937
    style A fill:#0ea5e9,stroke:#0369a1,color:#fff
    style B fill:#059669,stroke:#047857,color:#fff

import whisper
import numpy as np

class WhisperSTT:
    def __init__(self, model_size: str = "base"):
        # Model sizes: tiny, base, small, medium, large-v3
        self.model = whisper.load_model(model_size)

    def transcribe_file(self, audio_path: str) -> dict:
        result = self.model.transcribe(
            audio_path,
            language="en",
            fp16=True,           # Use half precision on GPU
            condition_on_previous_text=True,
        )
        return {
            "text": result["text"],
            "segments": result["segments"],
            "language": result["language"],
        }

    def transcribe_array(self, audio_array: np.ndarray) -> str:
        """Transcribe raw audio from a NumPy array (16kHz mono)."""
        result = self.model.transcribe(audio_array)
        return result["text"]

# Usage
stt = WhisperSTT("small")
result = stt.transcribe_file("call_recording.wav")
print(result["text"])

Strengths: Free when self-hosted, excellent accuracy on clean audio, supports 99 languages, full data privacy. Weaknesses: No native streaming support (batch-only), requires GPU for real-time performance, higher latency than cloud APIs.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

For real-time agents, you can use faster-whisper, a CTranslate2 port that runs 4x faster than the original:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5, vad_filter=True)

for segment in segments:
    print(f"[{segment.start:.1f}s -> {segment.end:.1f}s] {segment.text}")

Deepgram Nova: Built for Real-Time

Deepgram Nova-2 is purpose-built for low-latency streaming transcription. It consistently achieves the fastest time-to-first-transcript among cloud providers.

from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions
import asyncio

class DeepgramSTT:
    def __init__(self, api_key: str):
        self.client = DeepgramClient(api_key)

    async def stream_microphone(self, callback):
        connection = self.client.listen.asynclive.v("1")

        async def on_transcript(self, result, **kwargs):
            alt = result.channel.alternatives[0]
            if alt.transcript:
                callback(
                    text=alt.transcript,
                    is_final=result.is_final,
                    confidence=alt.confidence,
                    words=alt.words,
                )

        connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

        options = LiveOptions(
            model="nova-2",
            language="en-US",
            smart_format=True,       # Auto punctuation and formatting
            diarize=True,            # Speaker identification
            interim_results=True,
            endpointing=300,
            filler_words=False,      # Remove "um", "uh"
            utterance_end_ms=1000,
        )

        await connection.start(options)
        return connection

# Usage
stt = DeepgramSTT("your-api-key")

def handle_transcript(text, is_final, confidence, words):
    prefix = "FINAL" if is_final else "INTERIM"
    print(f"[{prefix}] ({confidence:.2f}) {text}")

asyncio.run(stt.stream_microphone(handle_transcript))

Strengths: Sub-200ms streaming latency, built-in diarization, smart formatting, excellent for real-time agents. Weaknesses: Cloud-only (no self-hosted option), cost scales with usage.

AssemblyAI Universal: Best-in-Class Accuracy

AssemblyAI Universal-2 leads accuracy benchmarks, especially on noisy audio, accented speech, and domain-specific vocabulary.

import assemblyai as aai

class AssemblyAISTT:
    def __init__(self, api_key: str):
        aai.settings.api_key = api_key

    def transcribe_with_analysis(self, audio_url: str) -> dict:
        config = aai.TranscriptionConfig(
            speech_model=aai.SpeechModel.best,
            speaker_labels=True,
            auto_highlights=True,
            sentiment_analysis=True,
            entity_detection=True,
        )

        transcriber = aai.Transcriber()
        transcript = transcriber.transcribe(audio_url, config=config)

        return {
            "text": transcript.text,
            "utterances": [
                {"speaker": u.speaker, "text": u.text}
                for u in transcript.utterances
            ],
            "sentiment": transcript.sentiment_analysis,
            "entities": transcript.entities,
        }

    def stream_realtime(self, on_data):
        transcriber = aai.RealtimeTranscriber(
            sample_rate=16000,
            on_data=on_data,
            on_error=lambda e: print(f"Error: {e}"),
        )
        transcriber.connect()
        return transcriber

Strengths: Highest accuracy on difficult audio, built-in NLU features (sentiment, entity detection, summarization), excellent streaming. Weaknesses: Higher per-minute pricing, fewer language options than Whisper.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Comparison Matrix

Feature	Whisper (self-hosted)	Deepgram Nova-2	AssemblyAI Universal-2
Streaming	No (batch only)	Yes (sub-200ms)	Yes (sub-300ms)
WER (clean audio)	~5%	~6%	~4.5%
Languages	99	36	20+
Self-hosted	Yes	No	No
Diarization	No (needs addon)	Built-in	Built-in
Price	Free (GPU cost)	$0.0043/min	$0.0062/min

Choosing the Right Engine

For real-time voice agents where latency is critical, Deepgram Nova-2 is the strongest choice. For offline processing or when data privacy is paramount, self-hosted Whisper with faster-whisper gives you full control. For high-accuracy scenarios with challenging audio (call centers, medical transcription), AssemblyAI leads on accuracy benchmarks.

FAQ

Can I combine multiple STT engines for better results?

Yes, a common production pattern is to use Deepgram for real-time streaming during the conversation (optimizing for speed) and then re-transcribe the full recording with AssemblyAI or Whisper large-v3 afterward for analytics and compliance. This gives you the best of both worlds.

How do I handle background noise and accents?

All three engines handle moderate noise well, but preprocessing helps. Apply noise reduction before sending audio to the STT engine. For accents, AssemblyAI consistently performs best. You can also fine-tune Whisper on domain-specific audio data to improve accuracy for your specific use case.

What sample rate and format should I send audio in?

For all three providers, 16kHz mono PCM (linear16) is the standard. Higher sample rates like 48kHz do not improve accuracy and waste bandwidth. If your source audio is stereo, mix it to mono before sending.

#SpeechtoText #Whisper #Deepgram #AssemblyAI #VoiceAI #STT #AgenticAI #LearnAI #AIEngineering

Speech-to-Text for AI Agents: Comparing Whisper, Deepgram, and AssemblyAI

Why STT Choice Matters for Voice Agents

OpenAI Whisper: The Open-Source Powerhouse

Deepgram Nova: Built for Real-Time

AssemblyAI Universal: Best-in-Class Accuracy

Comparison Matrix

Choosing the Right Engine

FAQ

Can I combine multiple STT engines for better results?

How do I handle background noise and accents?

What sample rate and format should I send audio in?

Try CallSphere AI Voice Agents

Related Articles You May Like

Defense, ITAR & AI Voice Vendor Compliance in 2026

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real