Voice Activity Detection and Turn Management in Conversational AI
Master voice activity detection algorithms, turn-taking strategies, overlapping speech handling, and silence threshold tuning to build natural-sounding conversational AI agents.
The Invisible Foundation of Voice Agents
When you talk to another person, you instinctively know when they have finished speaking. You detect pauses, falling intonation, syntactic completeness, and body language. Machines have none of these instincts. They need Voice Activity Detection (VAD) and explicit turn management logic to decide when to listen, when to speak, and when to yield.
Get this wrong and your voice agent either cuts users off mid-sentence or sits in awkward silence for seconds after they stop talking. Get it right and the conversation feels as fluid as talking to a human colleague.
What Is Voice Activity Detection?
VAD is the process of determining whether an audio frame contains human speech or is just background noise. It sounds simple, but the real world is messy: keyboard clicks, air conditioning hum, dogs barking, other people talking in the background. A production VAD system must distinguish intentional speech from all of this.
flowchart LR
CALLER(["Caller"])
subgraph TEL["Telephony"]
SIP["Twilio SIP and PSTN"]
end
subgraph BRAIN["Business AI Agent"]
STT["Streaming STT<br/>Deepgram or Whisper"]
NLU{"Intent and<br/>Entity Extraction"}
TOOLS["Tool Calls"]
TTS["Streaming TTS<br/>ElevenLabs or Rime"]
end
subgraph DATA["Live Data Plane"]
CRM[("CRM and Notes")]
CAL[("Calendar and<br/>Schedule")]
KB[("Knowledge Base<br/>and Policies")]
end
subgraph OUT["Outcomes"]
O1(["Booking captured"])
O2(["CRM record created"])
O3(["Human handoff"])
end
CALLER --> SIP --> STT --> NLU
NLU -->|Lookup| TOOLS
TOOLS <--> CRM
TOOLS <--> CAL
TOOLS <--> KB
NLU --> TTS --> SIP --> CALLER
NLU -->|Resolved| O1
NLU -->|Schedule| O2
NLU -->|Escalate| O3
style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
style O1 fill:#059669,stroke:#047857,color:#fff
style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
Energy-Based VAD
The simplest approach measures the signal energy (volume) of each audio frame:
import numpy as np
def energy_vad(audio_frame: np.ndarray, threshold: float = 0.02) -> bool:
"""Return True if the frame contains speech based on energy."""
rms = np.sqrt(np.mean(audio_frame ** 2))
return rms > threshold
Energy-based VAD is fast and cheap but fails in noisy environments. A loud air conditioner can register as speech, while a soft-spoken user can fall below the threshold.
Zero-Crossing Rate VAD
Speech has characteristic patterns in how often the audio signal crosses zero. Combining zero-crossing rate with energy gives a more robust detector:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def zero_crossing_rate(audio_frame: np.ndarray) -> float:
"""Calculate the zero-crossing rate of an audio frame."""
signs = np.sign(audio_frame)
crossings = np.sum(np.abs(np.diff(signs)) > 0)
return crossings / len(audio_frame)
def combined_vad(
audio_frame: np.ndarray,
energy_threshold: float = 0.02,
zcr_range: tuple = (0.1, 0.5),
) -> bool:
"""Combine energy and zero-crossing rate for VAD."""
rms = np.sqrt(np.mean(audio_frame ** 2))
zcr = zero_crossing_rate(audio_frame)
has_energy = rms > energy_threshold
has_speech_zcr = zcr_range[0] <= zcr <= zcr_range[1]
return has_energy and has_speech_zcr
Neural VAD Models
Modern production systems use neural network VAD models like Silero VAD or WebRTC VAD. These are trained on massive datasets and handle noise far better than heuristic methods:
import torch
# Silero VAD — lightweight, runs on CPU in real time
model, utils = torch.hub.load(
repo_or_dir="snakers4/silero-vad",
model="silero_vad",
force_reload=False,
)
(get_speech_timestamps, _, read_audio, _, _) = utils
def detect_speech_segments(audio_path: str) -> list:
"""Return timestamps of speech segments in the audio file."""
wav = read_audio(audio_path, sampling_rate=16000)
speech_timestamps = get_speech_timestamps(
wav, model, sampling_rate=16000
)
return speech_timestamps
Silero VAD processes 30ms audio chunks and returns a probability between 0 and 1. A threshold of 0.5 works well for most environments, but you can tune it based on your deployment context.
Turn-Taking Strategies
Detecting speech is only the first step. You also need to decide when a user has finished their turn so the agent can respond. This is the turn-taking problem.
Silence-Based Turn Detection
The most common strategy: if the user stops speaking for a configurable duration, assume their turn is complete.
import time
from dataclasses import dataclass, field
@dataclass
class TurnDetector:
silence_threshold: float = 0.7 # seconds of silence before turn ends
_last_speech_time: float = field(default=0.0, init=False)
_is_speaking: bool = field(default=False, init=False)
def process_frame(self, is_speech: bool) -> str:
"""Process a VAD result and return the turn state."""
now = time.time()
if is_speech:
self._last_speech_time = now
if not self._is_speaking:
self._is_speaking = True
return "turn_started"
return "speaking"
if self._is_speaking:
silence_duration = now - self._last_speech_time
if silence_duration >= self.silence_threshold:
self._is_speaking = False
return "turn_ended"
return "pause"
return "idle"
The silence threshold is the single most impactful parameter in turn management. Too short (under 0.4 seconds) and you cut off users who are pausing to think. Too long (over 1.5 seconds) and the agent feels sluggish.
Adaptive Silence Thresholds
A fixed threshold does not fit every situation. Some users speak quickly with short pauses; others think carefully between phrases. Adaptive thresholds adjust in real time:
@dataclass
class AdaptiveTurnDetector:
base_threshold: float = 0.7
min_threshold: float = 0.4
max_threshold: float = 1.5
adaptation_rate: float = 0.1
_pause_history: list = field(default_factory=list, init=False)
_current_threshold: float = field(default=0.7, init=False)
def record_pause(self, pause_duration: float):
"""Record a mid-turn pause to adapt the threshold."""
self._pause_history.append(pause_duration)
if len(self._pause_history) > 20:
self._pause_history.pop(0)
if len(self._pause_history) >= 3:
avg_pause = sum(self._pause_history) / len(self._pause_history)
target = avg_pause * 2.0 # 2x the average pause
self._current_threshold += (
(target - self._current_threshold) * self.adaptation_rate
)
self._current_threshold = max(
self.min_threshold,
min(self.max_threshold, self._current_threshold),
)
@property
def threshold(self) -> float:
return self._current_threshold
This detector learns the user's speaking rhythm. If a user consistently pauses for 0.3 seconds between thoughts, the threshold settles around 0.6 seconds — fast enough to feel responsive but not so fast that it interrupts mid-thought pauses.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Handling Overlapping Speech
Real conversations have overlap. Users sometimes start speaking before the agent finishes, or they provide brief acknowledgments ("uh-huh", "yeah") while the agent is talking. Your system must handle these gracefully.
Overlap Classification
Not all overlaps are the same. Classify them to respond appropriately:
from enum import Enum
class OverlapType(str, Enum):
BACKCHANNEL = "backchannel" # "uh-huh", "yeah", "ok"
INTERRUPTION = "interruption" # user wants to take the floor
COLLISION = "collision" # both started at the same time
def classify_overlap(
user_audio_energy: float,
user_speech_duration: float,
agent_is_speaking: bool,
) -> OverlapType:
"""Classify the type of speech overlap."""
if not agent_is_speaking:
return OverlapType.COLLISION
# Short, low-energy speech during agent turn = backchannel
if user_speech_duration < 0.5 and user_audio_energy < 0.05:
return OverlapType.BACKCHANNEL
# Sustained speech during agent turn = interruption
return OverlapType.INTERRUPTION
For backchannels, the agent should continue speaking. For interruptions, the agent should stop and yield the floor. This distinction prevents the agent from halting every time a user says "mm-hmm."
Integrating VAD with OpenAI Realtime API
The OpenAI Realtime API provides built-in server-side VAD, but understanding how to configure it is essential:
import json
import websockets
async def configure_realtime_session(ws):
"""Configure the OpenAI Realtime API session with VAD settings."""
await ws.send(json.dumps({
"type": "session.update",
"session": {
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 700,
},
"input_audio_transcription": {
"model": "whisper-1",
},
},
}))
The three key parameters are threshold (VAD sensitivity, 0.0 to 1.0), prefix_padding_ms (how much audio before detected speech to include, preventing clipped beginnings), and silence_duration_ms (how long to wait after speech ends before finalizing the turn).
Production Tuning Guidelines
After deploying VAD and turn management across multiple voice agents, these guidelines consistently produce the best results:
- Start with server VAD at threshold 0.5 and silence 700ms, then tune based on user feedback
- Log every turn event — turn_started, turn_ended, interruption, backchannel — with timestamps for analysis
- Measure end-of-turn latency as the time between the user stopping speech and the agent beginning its response; target under 500ms total
- Test with diverse audio conditions: quiet rooms, noisy cafes, speakerphone, Bluetooth headsets
- Add a visual indicator (for screen-based agents) showing whether the system thinks the user is speaking — this helps users adjust their behavior
The difference between a frustrating voice agent and a delightful one often comes down to 200 milliseconds of silence threshold tuning. Invest the time to get it right.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.