Skip to content
Learn Agentic AI
Learn Agentic AI10 min read23 views

Handling Voice Agent Interruptions and Barge-In

Learn how to handle user interruptions and barge-in events in voice agents with lifecycle management, audio muting, graceful cancellation, and response resumption strategies.

Why Interruptions Are Inevitable

In natural conversation, people interrupt each other constantly. A user might say "actually, never mind" halfway through the agent's response. They might correct a misunderstood detail before the agent finishes acting on it. Or they might already know the information being delivered and want to skip ahead.

A voice agent that ignores interruptions — that bulldozes through its response regardless of what the user says — feels robotic and frustrating. Handling barge-in correctly is one of the hallmarks of a well-built voice experience.

The Barge-In Lifecycle

Barge-in is the event where a user starts speaking while the agent is still producing audio output. Handling it involves a sequence of steps:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
  1. Detect — VAD identifies user speech during agent playback
  2. Classify — Determine if it is a true interruption or a backchannel
  3. Cancel — Stop the agent's current audio output
  4. Capture — Record and transcribe the user's interrupting speech
  5. Resume — Process the interruption and generate an appropriate response
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import asyncio
import time

class InterruptionType(str, Enum):
    CORRECTION = "correction"       # "No, I said Tuesday"
    CANCELLATION = "cancellation"   # "Never mind" / "Stop"
    REDIRECT = "redirect"           # "Actually, can you help with..."
    BACKCHANNEL = "backchannel"     # "Uh-huh" / "OK"
    CLARIFICATION = "clarification" # "Wait, what was that?"

@dataclass
class InterruptionEvent:
    timestamp: float
    type: InterruptionType
    user_transcript: str
    agent_was_saying: str
    agent_progress_pct: float  # how far through the response
    handled: bool = False

Detecting True Interruptions vs Backchannels

Not every user utterance during agent speech is an interruption. The first challenge is distinguishing between a backchannel ("mm-hmm") and a genuine attempt to take the floor. We covered the basics in the VAD post — here we build a more sophisticated classifier:

@dataclass
class BargeInDetector:
    energy_threshold: float = 0.04
    duration_threshold: float = 0.6  # seconds
    backchannel_words: set = field(default_factory=lambda: {
        "uh-huh", "mm-hmm", "yeah", "yes", "ok", "okay",
        "right", "sure", "got it", "i see", "mhm",
    })
    _speech_start: Optional[float] = field(default=None, init=False)
    _accumulated_text: str = field(default="", init=False)

    def on_user_speech_start(self):
        """Called when VAD detects user speech during agent output."""
        self._speech_start = time.time()
        self._accumulated_text = ""

    def on_partial_transcript(self, text: str) -> Optional[InterruptionType]:
        """Process partial transcription to classify the interruption."""
        self._accumulated_text = text.strip().lower()

        # Check for backchannel
        if self._accumulated_text in self.backchannel_words:
            return InterruptionType.BACKCHANNEL

        # Check for explicit cancellation
        cancel_phrases = {"stop", "never mind", "nevermind", "cancel", "shut up"}
        if self._accumulated_text in cancel_phrases:
            return InterruptionType.CANCELLATION

        # Check for corrections
        if self._accumulated_text.startswith(("no ", "not ", "actually ")):
            return InterruptionType.CORRECTION

        # Check for redirects
        if self._accumulated_text.startswith(("can you ", "what about ", "instead ")):
            return InterruptionType.REDIRECT

        # If speech has been going long enough, it is a real interruption
        if self._speech_start and (time.time() - self._speech_start) > self.duration_threshold:
            return InterruptionType.REDIRECT

        return None  # Not enough data yet

The key insight is that classification is progressive. You start making a decision as soon as partial transcription arrives and refine it as more words come in. This minimizes the delay between the user speaking and the agent reacting.

Muting and Cancelling Agent Output

Once you determine the user is truly interrupting, you need to stop the agent's audio output immediately. With the OpenAI Realtime API, this means sending a cancel event:

import json

async def cancel_agent_response(ws, item_id: str):
    """Cancel the current agent response on the Realtime API."""
    await ws.send(json.dumps({
        "type": "response.cancel",
    }))

async def truncate_audio_output(ws, item_id: str, content_index: int, audio_end_ms: int):
    """Truncate the audio output at the current playback position."""
    await ws.send(json.dumps({
        "type": "conversation.item.truncate",
        "item_id": item_id,
        "content_index": content_index,
        "audio_end_ms": audio_end_ms,
    }))

On the client side, you also need to immediately stop audio playback. If there is buffered audio waiting to be played, flush it:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

@dataclass
class AudioPlaybackManager:
    _buffer: list = field(default_factory=list, init=False)
    _is_playing: bool = field(default=False, init=False)
    _muted: bool = field(default=False, init=False)

    def mute(self):
        """Immediately stop playback and clear the buffer."""
        self._muted = True
        self._is_playing = False
        self._buffer.clear()

    def unmute(self):
        """Allow playback to resume."""
        self._muted = False

    def enqueue(self, audio_chunk: bytes):
        """Add audio to the playback buffer."""
        if not self._muted:
            self._buffer.append(audio_chunk)

    def flush(self):
        """Clear all buffered audio without playing it."""
        self._buffer.clear()

Graceful Cancellation Patterns

Abruptly stopping mid-word sounds jarring. A more polished approach is to let the current word or phrase finish before stopping, then acknowledge the interruption:

async def handle_interruption(
    ws,
    event: InterruptionEvent,
    playback: AudioPlaybackManager,
):
    """Handle a classified interruption event."""
    if event.type == InterruptionType.BACKCHANNEL:
        # Do nothing — agent continues speaking
        return

    # Stop agent audio
    playback.mute()

    if event.type == InterruptionType.CANCELLATION:
        playback.flush()
        await send_agent_message(
            ws,
            "Understood, I will stop. What would you like to do instead?",
        )

    elif event.type == InterruptionType.CORRECTION:
        playback.flush()
        await send_agent_message(
            ws,
            f"Sorry about that. Let me address your correction: "
            f"{event.user_transcript}",
        )

    elif event.type == InterruptionType.REDIRECT:
        playback.flush()
        await send_agent_message(
            ws,
            f"Of course, let me help with that instead.",
        )

    elif event.type == InterruptionType.CLARIFICATION:
        playback.flush()
        await send_agent_message(
            ws,
            "Let me repeat that more clearly.",
        )

    event.handled = True
    playback.unmute()

async def send_agent_message(ws, text: str):
    """Inject a text message for the agent to speak."""
    await ws.send(json.dumps({
        "type": "conversation.item.create",
        "item": {
            "type": "message",
            "role": "assistant",
            "content": [{"type": "input_text", "text": text}],
        },
    }))
    await ws.send(json.dumps({"type": "response.create"}))

Tracking Interruption Context

The agent needs to know what it was saying when interrupted so it can resume or adjust. Track the context:

@dataclass
class ConversationTracker:
    _current_response_text: str = field(default="", init=False)
    _current_item_id: Optional[str] = field(default=None, init=False)
    _interruption_history: list = field(default_factory=list, init=False)

    def on_response_text_delta(self, item_id: str, delta: str):
        """Track the agent's response as it streams."""
        self._current_item_id = item_id
        self._current_response_text += delta

    def on_interruption(self, user_text: str) -> InterruptionEvent:
        """Create an interruption event with full context."""
        progress = len(self._current_response_text)
        event = InterruptionEvent(
            timestamp=time.time(),
            type=InterruptionType.REDIRECT,
            user_transcript=user_text,
            agent_was_saying=self._current_response_text,
            agent_progress_pct=min(progress / max(progress + 50, 1), 1.0),
        )
        self._interruption_history.append(event)
        self._current_response_text = ""
        return event

    @property
    def interruption_rate(self) -> float:
        """Track how often the user interrupts — high rates suggest issues."""
        if not self._interruption_history:
            return 0.0
        recent = [
            e for e in self._interruption_history
            if time.time() - e.timestamp < 300  # last 5 minutes
        ]
        return len(recent) / 5.0  # interruptions per minute

A high interruption rate is a signal that something is wrong. The agent might be speaking too slowly, providing irrelevant information, or misunderstanding the user. Log and monitor this metric.

Production Best Practices

  1. Always prefer false negatives over false positives — it is better to miss a backchannel than to incorrectly stop a response due to a cough
  2. Add a minimum speech duration (200-300ms) before triggering barge-in to filter out transient noises
  3. Track what was interrupted so the agent can offer to continue: "I was explaining the refund policy. Would you like me to continue?"
  4. Test with real users early — interruption patterns vary wildly between people, cultures, and contexts
  5. Log every interruption event with timestamps, classification, and user transcript for iterative improvement
  6. Set up alerts on interruption rate spikes — they often indicate a regression in agent behavior or audio quality

Handling interruptions well is what separates a demo-grade voice agent from one that users actually want to talk to. The investment in barge-in logic pays off in every single conversation.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Voice Agents

Voice Agent Ending the Call Gracefully (2026)

96% of well-designed agents close calls politely; the rest leave callers with the robotic-hangup feeling that undermines the whole flow. We map endCallPhrase tuning, silence-timeout policies, and CallSphere's vertical farewell library.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Funding & Industry

OpenAI revenue run-rate — April 2026 read — April 2026 update

OpenAI's April 2026 reported revenue run-rate cleared $13B annualized, on continued ChatGPT growth, agentic Operator monetization, and enterprise API expansion.

Funding & Industry

Stargate progress update — April 2026 site and capex

OpenAI's Stargate with Oracle and SoftBank crossed a milestone in April 2026 with the first Texas site partially energized and three additional sites under construction.

Funding & Industry

OpenAI acquisitions and acquihires — April 2026 roundup

April 2026 saw OpenAI complete two small acquisitions and several acquihires across robotics and enterprise agent teams, expanding the post-Stargate hiring spree.