Skip to content
Learn Agentic AI
Learn Agentic AI11 min read5 views

Audio Preprocessing for Voice Agents: Noise Reduction, Echo Cancellation, and Normalization

Build a complete audio preprocessing pipeline for voice AI agents — covering noise reduction, echo cancellation, gain normalization, and both client-side Web Audio API and server-side Python implementations.

Why Preprocessing Matters

Raw microphone audio is messy. It contains background noise (fans, traffic, other conversations), echo from the agent's own speech playing through speakers, volume inconsistencies (some users speak quietly, others shout), and room reverberation. Feeding raw audio directly to your STT engine degrades transcription accuracy and produces unreliable results.

A well-designed preprocessing pipeline cleans the audio before it reaches the STT engine, dramatically improving word accuracy and reducing hallucinated transcriptions. The goal is to deliver clean, normalized speech at a consistent volume level.

Client-Side Preprocessing with Web Audio API

The browser's Web Audio API lets you process audio in real time before sending it to the server. This reduces bandwidth and offloads processing from your backend.

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
class AudioPreprocessor {
  constructor() {
    this.audioContext = null;
    this.sourceNode = null;
    this.processorNode = null;
  }

  async init(stream) {
    this.audioContext = new AudioContext({ sampleRate: 16000 });
    this.sourceNode = this.audioContext.createMediaStreamSource(stream);

    // High-pass filter to remove low-frequency rumble (below 80Hz)
    const highPass = this.audioContext.createBiquadFilter();
    highPass.type = 'highpass';
    highPass.frequency.value = 80;
    highPass.Q.value = 0.7;

    // Low-pass filter to remove high-frequency hiss (above 8kHz)
    const lowPass = this.audioContext.createBiquadFilter();
    lowPass.type = 'lowpass';
    lowPass.frequency.value = 8000;
    lowPass.Q.value = 0.7;

    // Compressor for volume normalization
    const compressor = this.audioContext.createDynamicsCompressor();
    compressor.threshold.value = -30;   // Start compressing at -30dB
    compressor.knee.value = 10;
    compressor.ratio.value = 4;         // 4:1 compression ratio
    compressor.attack.value = 0.005;    // 5ms attack
    compressor.release.value = 0.1;     // 100ms release

    // Gain to boost after compression
    const gainNode = this.audioContext.createGain();
    gainNode.gain.value = 1.5;

    // Connect the chain
    this.sourceNode
      .connect(highPass)
      .connect(lowPass)
      .connect(compressor)
      .connect(gainNode);

    return gainNode;
  }

  getProcessedStream(gainNode) {
    const destination = this.audioContext.createMediaStreamDestination();
    gainNode.connect(destination);
    return destination.stream;
  }
}

// Usage
const rawStream = await navigator.mediaDevices.getUserMedia({ audio: true });
const preprocessor = new AudioPreprocessor();
const outputNode = await preprocessor.init(rawStream);
const cleanStream = preprocessor.getProcessedStream(outputNode);
// Use cleanStream for WebRTC or recording

AudioWorklet for Advanced Processing

For more sophisticated processing like spectral noise reduction, use an AudioWorklet. This runs in a separate thread so it does not block the main UI.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
// noise-suppressor-worklet.js
class NoiseSuppressorProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this.noiseFloor = new Float32Array(128).fill(0.001);
    this.alpha = 0.98;  // Smoothing factor for noise estimation
  }

  process(inputs, outputs) {
    const input = inputs[0][0];
    const output = outputs[0][0];

    if (!input) return true;

    for (let i = 0; i < input.length; i++) {
      const magnitude = Math.abs(input[i]);

      // Update noise floor estimate during silence
      if (magnitude < this.noiseFloor[i % 128] * 3) {
        this.noiseFloor[i % 128] =
          this.alpha * this.noiseFloor[i % 128] +
          (1 - this.alpha) * magnitude;
      }

      // Spectral subtraction: reduce signal by noise estimate
      const noiseEst = this.noiseFloor[i % 128] * 2;
      if (magnitude > noiseEst) {
        output[i] = input[i] * (1 - noiseEst / magnitude);
      } else {
        output[i] = input[i] * 0.05;  // Soft gate, don't zero out
      }
    }

    return true;
  }
}

registerProcessor('noise-suppressor', NoiseSuppressorProcessor);

Register and use the worklet in your main code:

await audioContext.audioWorklet.addModule('noise-suppressor-worklet.js');
const suppressorNode = new AudioWorkletNode(audioContext, 'noise-suppressor');

// Insert into the processing chain
sourceNode.connect(suppressorNode).connect(compressor);

Server-Side Preprocessing with Python

When you need more powerful noise reduction than what the browser can provide, process audio on the server using libraries like noisereduce and scipy.

import numpy as np
import noisereduce as nr
from scipy.signal import butter, sosfilt
from scipy.io import wavfile

class ServerAudioPreprocessor:
    def __init__(self, sample_rate: int = 16000):
        self.sample_rate = sample_rate
        self.target_rms = 0.1  # Target RMS for normalization

    def preprocess(self, audio: np.ndarray) -> np.ndarray:
        """Full preprocessing pipeline."""
        audio = audio.astype(np.float32)
        if audio.max() > 1.0:
            audio = audio / 32768.0  # Convert int16 to float

        audio = self._bandpass_filter(audio, low=80, high=8000)
        audio = self._reduce_noise(audio)
        audio = self._normalize(audio)
        audio = self._trim_silence(audio)

        return audio

    def _bandpass_filter(
        self, audio: np.ndarray, low: int, high: int
    ) -> np.ndarray:
        sos = butter(
            5, [low, high], btype='band',
            fs=self.sample_rate, output='sos',
        )
        return sosfilt(sos, audio)

    def _reduce_noise(self, audio: np.ndarray) -> np.ndarray:
        return nr.reduce_noise(
            y=audio,
            sr=self.sample_rate,
            stationary=False,   # Non-stationary noise (better for real-world)
            prop_decrease=0.8,  # Reduce noise by 80%
            n_fft=512,
            hop_length=128,
        )

    def _normalize(self, audio: np.ndarray) -> np.ndarray:
        rms = np.sqrt(np.mean(audio ** 2))
        if rms > 0:
            audio = audio * (self.target_rms / rms)
        return np.clip(audio, -1.0, 1.0)

    def _trim_silence(
        self, audio: np.ndarray, threshold: float = 0.01
    ) -> np.ndarray:
        mask = np.abs(audio) > threshold
        if not mask.any():
            return audio
        first = mask.argmax()
        last = len(mask) - mask[::-1].argmax()
        # Keep small padding
        pad = int(0.05 * self.sample_rate)
        return audio[max(0, first - pad):min(len(audio), last + pad)]

# Usage
preprocessor = ServerAudioPreprocessor(sample_rate=16000)
sample_rate, raw_audio = wavfile.read("recording.wav")
clean_audio = preprocessor.preprocess(raw_audio)

Echo Cancellation

Echo cancellation removes the agent's own voice from the user's microphone input. The browser handles this when you enable echoCancellation: true in getUserMedia. For server-side echo cancellation, you need the agent's output audio as a reference signal.

from scipy.signal import fftconvolve

class SimpleAEC:
    """Simplified Acoustic Echo Cancellation using cross-correlation."""

    def __init__(self, filter_length: int = 4096):
        self.filter_length = filter_length
        self.filter_coeffs = np.zeros(filter_length)
        self.mu = 0.01  # Learning rate

    def cancel_echo(
        self, mic_signal: np.ndarray, ref_signal: np.ndarray
    ) -> np.ndarray:
        """Remove echo of ref_signal from mic_signal."""
        n = len(mic_signal)
        output = np.zeros(n)

        for i in range(self.filter_length, n):
            ref_chunk = ref_signal[i - self.filter_length:i][::-1]
            echo_estimate = np.dot(self.filter_coeffs, ref_chunk)
            error = mic_signal[i] - echo_estimate
            output[i] = error

            # Adaptive filter update (NLMS)
            power = np.dot(ref_chunk, ref_chunk) + 1e-10
            self.filter_coeffs += self.mu * error * ref_chunk / power

        return output

In practice, WebRTC's built-in AEC is far more sophisticated and handles non-linear echo, double-talk, and dynamic room conditions. Use it whenever possible.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

FAQ

Should I preprocess audio on the client or the server?

Do both. Client-side preprocessing (filtering, compression, gain) reduces bandwidth and gives the server cleaner input. Server-side preprocessing (noise reduction, echo cancellation) handles the heavy lifting. This layered approach is standard in production voice systems. The browser's built-in audio constraints (echoCancellation, noiseSuppression, autoGainControl) provide a solid baseline that handles 80% of cases.

Does preprocessing degrade STT accuracy?

Aggressive preprocessing can remove speech content along with noise, particularly overly aggressive noise reduction or narrow bandpass filters. The key is to tune your preprocessing parameters on representative audio samples and measure the STT word error rate before and after. In most cases, well-tuned preprocessing improves STT accuracy by 10-30% compared to raw audio.

How do I handle audio from different microphone types?

Different microphones (laptop built-in, USB headset, phone) have vastly different frequency responses and sensitivity levels. Normalization is the key — apply automatic gain control to bring all inputs to a consistent RMS level. The compressor in the Web Audio API chain handles this well. Additionally, the bandpass filter removes frequencies that are outside the speech range regardless of microphone type.


#AudioPreprocessing #NoiseReduction #EchoCancellation #WebAudioAPI #VoiceAI #SignalProcessing #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.