Voice Activity Detection: Knowing When Users Start and Stop Speaking

What Is Voice Activity Detection and Why Does It Matter

Voice Activity Detection (VAD) is the process of determining whether a given audio segment contains human speech or just background noise. In voice AI agents, VAD serves three critical functions: it tells the STT engine when to start and stop processing, it enables the agent to detect when the user has finished their turn (endpointing), and it allows barge-in detection when the user interrupts the agent.

Without good VAD, your agent either starts transcribing background noise (false positives, wasting resources and producing garbage text) or misses the beginning of user speech (false negatives, cutting off words and frustrating users).

Energy-Based VAD: The Simple Approach

The simplest VAD method measures the energy (volume) of each audio frame. If the energy exceeds a threshold, the frame is classified as speech.

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

import numpy as np
from collections import deque

class EnergyVAD:
    def __init__(
        self,
        threshold_db: float = -35.0,
        frame_duration_ms: int = 30,
        sample_rate: int = 16000,
        min_speech_ms: int = 200,
        min_silence_ms: int = 500,
    ):
        self.threshold_db = threshold_db
        self.frame_size = int(sample_rate * frame_duration_ms / 1000)
        self.min_speech_frames = int(min_speech_ms / frame_duration_ms)
        self.min_silence_frames = int(min_silence_ms / frame_duration_ms)
        self.speech_count = 0
        self.silence_count = 0
        self.is_speaking = False

    def compute_rms_db(self, frame: np.ndarray) -> float:
        rms = np.sqrt(np.mean(frame.astype(np.float32) ** 2))
        if rms == 0:
            return -100.0
        return 20 * np.log10(rms / 32768.0)

    def process_frame(self, frame: np.ndarray) -> dict:
        energy_db = self.compute_rms_db(frame)
        is_speech_frame = energy_db > self.threshold_db

        if is_speech_frame:
            self.speech_count += 1
            self.silence_count = 0
        else:
            self.silence_count += 1
            self.speech_count = 0

        # State transitions
        event = None
        if not self.is_speaking and self.speech_count >= self.min_speech_frames:
            self.is_speaking = True
            event = "speech_start"
        elif self.is_speaking and self.silence_count >= self.min_silence_frames:
            self.is_speaking = False
            event = "speech_end"

        return {
            "is_speaking": self.is_speaking,
            "energy_db": energy_db,
            "event": event,
        }

Energy-based VAD is fast and requires zero dependencies, but it struggles in noisy environments. A loud air conditioner or keyboard typing can easily exceed the threshold, triggering false positives.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

ML-Based VAD: Silero VAD

Silero VAD is a lightweight neural network trained specifically for voice activity detection. It runs in real time on CPU and dramatically outperforms energy-based methods in noisy conditions.

import torch
import numpy as np

class SileroVAD:
    def __init__(self, threshold: float = 0.5, sample_rate: int = 16000):
        self.model, self.utils = torch.hub.load(
            repo_or_dir="snakers4/silero-vad",
            model="silero_vad",
            trust_repo=True,
        )
        self.threshold = threshold
        self.sample_rate = sample_rate
        self.is_speaking = False
        self.speech_frames = 0
        self.silence_frames = 0

    def process_chunk(self, audio_chunk: np.ndarray) -> dict:
        """Process a 512-sample chunk (32ms at 16kHz)."""
        tensor = torch.from_numpy(audio_chunk).float()

        # Silero returns a probability of speech
        speech_prob = self.model(tensor, self.sample_rate).item()

        event = None
        if speech_prob >= self.threshold:
            self.speech_frames += 1
            self.silence_frames = 0
            if not self.is_speaking and self.speech_frames >= 4:
                self.is_speaking = True
                event = "speech_start"
        else:
            self.silence_frames += 1
            self.speech_frames = 0
            if self.is_speaking and self.silence_frames >= 16:
                self.is_speaking = False
                event = "speech_end"

        return {
            "speech_probability": speech_prob,
            "is_speaking": self.is_speaking,
            "event": event,
        }

# Usage
vad = SileroVAD(threshold=0.5)

def handle_audio_frame(frame):
    result = vad.process_chunk(frame)
    if result["event"] == "speech_start":
        print("User started speaking — activate STT")
    elif result["event"] == "speech_end":
        print("User stopped speaking — finalize transcript")

Silero VAD runs at less than 1ms per chunk on CPU, making it suitable for real-time applications. The model is only about 2MB, so it can even run in the browser via ONNX Runtime.

Browser-Side VAD with JavaScript

Running VAD in the browser reduces server load and enables faster speech detection because there is no network round-trip.

class BrowserVAD {
  constructor(options = {}) {
    this.threshold = options.threshold || 0.5;
    this.onSpeechStart = options.onSpeechStart || (() => {});
    this.onSpeechEnd = options.onSpeechEnd || (() => {});
    this.isSpeaking = false;
    this.silenceFrames = 0;
    this.silenceLimit = options.silenceFrames || 15;
  }

  async init() {
    // Load Silero VAD ONNX model in the browser
    const { useMicVAD } = await import('@ricky0123/vad-web');

    this.vad = await useMicVAD({
      positiveSpeechThreshold: this.threshold,
      negativeSpeechThreshold: this.threshold - 0.15,
      minSpeechFrames: 4,
      preSpeechPadFrames: 3,
      redemptionFrames: 8,
      onSpeechStart: () => {
        this.isSpeaking = true;
        this.onSpeechStart();
      },
      onSpeechEnd: (audio) => {
        this.isSpeaking = false;
        this.onSpeechEnd(audio);
      },
    });
  }

  start() { this.vad.start(); }
  pause() { this.vad.pause(); }
  destroy() { this.vad.destroy(); }
}

// Usage
const vad = new BrowserVAD({
  threshold: 0.5,
  onSpeechStart: () => console.log('Speech detected — open STT stream'),
  onSpeechEnd: (audio) => {
    console.log('Speech ended — send audio to server');
    sendAudioToServer(audio);
  },
});
await vad.init();
vad.start();

Tuning VAD Sensitivity

The key parameters to tune are the speech probability threshold, minimum speech duration, and silence timeout.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Threshold too low (0.3): More false positives — background noise triggers speech detection
Threshold too high (0.8): More false negatives — quiet or soft speech is missed
Silence timeout too short (200ms): Cuts off speech during natural pauses
Silence timeout too long (1500ms): Agent waits too long before responding

A good starting point is a threshold of 0.5, minimum speech of 150ms, and silence timeout of 600-800ms. From there, tune based on your specific environment and user feedback.

FAQ

Should I run VAD on the client or the server?

Running VAD on the client is ideal for bandwidth optimization — you only send audio to the server when speech is detected. This can reduce bandwidth by 60-80% in typical conversations. However, server-side VAD gives you more control and consistency. Many production systems run VAD on both sides: client-side for bandwidth savings and server-side for reliable endpointing.

How does VAD interact with echo cancellation?

Without echo cancellation, VAD will detect the agent's own speech playing through the speakers as user speech, creating a feedback loop. WebRTC's built-in AEC (Acoustic Echo Cancellation) handles this automatically. If you are using raw audio streams without WebRTC, you need to implement echo cancellation before VAD, or use a reference signal to suppress the agent's output from the input stream.

Can VAD distinguish between speech and non-speech sounds like coughing or typing?

ML-based VAD models like Silero are specifically trained to detect human speech patterns, so they handle most non-speech sounds well. However, they can still be triggered by sounds that resemble speech patterns, such as music with vocals or TV audio in the background. For these edge cases, combining VAD with a short STT verification step — checking if the transcription is meaningful — provides an additional layer of filtering.

#VoiceActivityDetection #VAD #SileroVAD #VoiceAI #AudioProcessing #SpeechDetection #AgenticAI #LearnAI #AIEngineering

Voice Activity Detection: Knowing When Users Start and Stop Speaking

What Is Voice Activity Detection and Why Does It Matter

Energy-Based VAD: The Simple Approach

ML-Based VAD: Silero VAD

Browser-Side VAD with JavaScript

Tuning VAD Sensitivity

FAQ

Should I run VAD on the client or the server?

How does VAD interact with echo cancellation?

Can VAD distinguish between speech and non-speech sounds like coughing or typing?

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Defense, ITAR & AI Voice Vendor Compliance in 2026

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Call Sentiment Time-Series Dashboards for Voice AI in 2026