Skip to content
Learn Agentic AI
Learn Agentic AI11 min read5 views

Voice Activity Detection: Knowing When Users Start and Stop Speaking

Learn how Voice Activity Detection works in voice AI agents — from energy-based methods to ML-based VAD models like Silero — including configuration, sensitivity tuning, and practical implementation.

What Is Voice Activity Detection and Why Does It Matter

Voice Activity Detection (VAD) is the process of determining whether a given audio segment contains human speech or just background noise. In voice AI agents, VAD serves three critical functions: it tells the STT engine when to start and stop processing, it enables the agent to detect when the user has finished their turn (endpointing), and it allows barge-in detection when the user interrupts the agent.

Without good VAD, your agent either starts transcribing background noise (false positives, wasting resources and producing garbage text) or misses the beginning of user speech (false negatives, cutting off words and frustrating users).

Energy-Based VAD: The Simple Approach

The simplest VAD method measures the energy (volume) of each audio frame. If the energy exceeds a threshold, the frame is classified as speech.

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
import numpy as np
from collections import deque

class EnergyVAD:
    def __init__(
        self,
        threshold_db: float = -35.0,
        frame_duration_ms: int = 30,
        sample_rate: int = 16000,
        min_speech_ms: int = 200,
        min_silence_ms: int = 500,
    ):
        self.threshold_db = threshold_db
        self.frame_size = int(sample_rate * frame_duration_ms / 1000)
        self.min_speech_frames = int(min_speech_ms / frame_duration_ms)
        self.min_silence_frames = int(min_silence_ms / frame_duration_ms)
        self.speech_count = 0
        self.silence_count = 0
        self.is_speaking = False

    def compute_rms_db(self, frame: np.ndarray) -> float:
        rms = np.sqrt(np.mean(frame.astype(np.float32) ** 2))
        if rms == 0:
            return -100.0
        return 20 * np.log10(rms / 32768.0)

    def process_frame(self, frame: np.ndarray) -> dict:
        energy_db = self.compute_rms_db(frame)
        is_speech_frame = energy_db > self.threshold_db

        if is_speech_frame:
            self.speech_count += 1
            self.silence_count = 0
        else:
            self.silence_count += 1
            self.speech_count = 0

        # State transitions
        event = None
        if not self.is_speaking and self.speech_count >= self.min_speech_frames:
            self.is_speaking = True
            event = "speech_start"
        elif self.is_speaking and self.silence_count >= self.min_silence_frames:
            self.is_speaking = False
            event = "speech_end"

        return {
            "is_speaking": self.is_speaking,
            "energy_db": energy_db,
            "event": event,
        }

Energy-based VAD is fast and requires zero dependencies, but it struggles in noisy environments. A loud air conditioner or keyboard typing can easily exceed the threshold, triggering false positives.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

ML-Based VAD: Silero VAD

Silero VAD is a lightweight neural network trained specifically for voice activity detection. It runs in real time on CPU and dramatically outperforms energy-based methods in noisy conditions.

import torch
import numpy as np

class SileroVAD:
    def __init__(self, threshold: float = 0.5, sample_rate: int = 16000):
        self.model, self.utils = torch.hub.load(
            repo_or_dir="snakers4/silero-vad",
            model="silero_vad",
            trust_repo=True,
        )
        self.threshold = threshold
        self.sample_rate = sample_rate
        self.is_speaking = False
        self.speech_frames = 0
        self.silence_frames = 0

    def process_chunk(self, audio_chunk: np.ndarray) -> dict:
        """Process a 512-sample chunk (32ms at 16kHz)."""
        tensor = torch.from_numpy(audio_chunk).float()

        # Silero returns a probability of speech
        speech_prob = self.model(tensor, self.sample_rate).item()

        event = None
        if speech_prob >= self.threshold:
            self.speech_frames += 1
            self.silence_frames = 0
            if not self.is_speaking and self.speech_frames >= 4:
                self.is_speaking = True
                event = "speech_start"
        else:
            self.silence_frames += 1
            self.speech_frames = 0
            if self.is_speaking and self.silence_frames >= 16:
                self.is_speaking = False
                event = "speech_end"

        return {
            "speech_probability": speech_prob,
            "is_speaking": self.is_speaking,
            "event": event,
        }

# Usage
vad = SileroVAD(threshold=0.5)

def handle_audio_frame(frame):
    result = vad.process_chunk(frame)
    if result["event"] == "speech_start":
        print("User started speaking — activate STT")
    elif result["event"] == "speech_end":
        print("User stopped speaking — finalize transcript")

Silero VAD runs at less than 1ms per chunk on CPU, making it suitable for real-time applications. The model is only about 2MB, so it can even run in the browser via ONNX Runtime.

Browser-Side VAD with JavaScript

Running VAD in the browser reduces server load and enables faster speech detection because there is no network round-trip.

class BrowserVAD {
  constructor(options = {}) {
    this.threshold = options.threshold || 0.5;
    this.onSpeechStart = options.onSpeechStart || (() => {});
    this.onSpeechEnd = options.onSpeechEnd || (() => {});
    this.isSpeaking = false;
    this.silenceFrames = 0;
    this.silenceLimit = options.silenceFrames || 15;
  }

  async init() {
    // Load Silero VAD ONNX model in the browser
    const { useMicVAD } = await import('@ricky0123/vad-web');

    this.vad = await useMicVAD({
      positiveSpeechThreshold: this.threshold,
      negativeSpeechThreshold: this.threshold - 0.15,
      minSpeechFrames: 4,
      preSpeechPadFrames: 3,
      redemptionFrames: 8,
      onSpeechStart: () => {
        this.isSpeaking = true;
        this.onSpeechStart();
      },
      onSpeechEnd: (audio) => {
        this.isSpeaking = false;
        this.onSpeechEnd(audio);
      },
    });
  }

  start() { this.vad.start(); }
  pause() { this.vad.pause(); }
  destroy() { this.vad.destroy(); }
}

// Usage
const vad = new BrowserVAD({
  threshold: 0.5,
  onSpeechStart: () => console.log('Speech detected — open STT stream'),
  onSpeechEnd: (audio) => {
    console.log('Speech ended — send audio to server');
    sendAudioToServer(audio);
  },
});
await vad.init();
vad.start();

Tuning VAD Sensitivity

The key parameters to tune are the speech probability threshold, minimum speech duration, and silence timeout.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • Threshold too low (0.3): More false positives — background noise triggers speech detection
  • Threshold too high (0.8): More false negatives — quiet or soft speech is missed
  • Silence timeout too short (200ms): Cuts off speech during natural pauses
  • Silence timeout too long (1500ms): Agent waits too long before responding

A good starting point is a threshold of 0.5, minimum speech of 150ms, and silence timeout of 600-800ms. From there, tune based on your specific environment and user feedback.

FAQ

Should I run VAD on the client or the server?

Running VAD on the client is ideal for bandwidth optimization — you only send audio to the server when speech is detected. This can reduce bandwidth by 60-80% in typical conversations. However, server-side VAD gives you more control and consistency. Many production systems run VAD on both sides: client-side for bandwidth savings and server-side for reliable endpointing.

How does VAD interact with echo cancellation?

Without echo cancellation, VAD will detect the agent's own speech playing through the speakers as user speech, creating a feedback loop. WebRTC's built-in AEC (Acoustic Echo Cancellation) handles this automatically. If you are using raw audio streams without WebRTC, you need to implement echo cancellation before VAD, or use a reference signal to suppress the agent's output from the input stream.

Can VAD distinguish between speech and non-speech sounds like coughing or typing?

ML-based VAD models like Silero are specifically trained to detect human speech patterns, so they handle most non-speech sounds well. However, they can still be triggered by sounds that resemble speech patterns, such as music with vocals or TV audio in the background. For these edge cases, combining VAD with a short STT verification step — checking if the transcription is meaningful — provides an additional layer of filtering.


#VoiceActivityDetection #VAD #SileroVAD #VoiceAI #AudioProcessing #SpeechDetection #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.