Skip to content
Learn Agentic AI
Learn Agentic AI13 min read7 views

Building a Voice UI for AI Agents: Microphone Input, Waveform Visualization, and Playback

Implement a voice interface for AI agents using the MediaRecorder API, real-time audio waveform visualization with Canvas, and audio playback controls in React.

Why Voice Interfaces for Agents

Voice interaction removes the typing bottleneck. Users can describe complex problems, provide context, and issue multi-step instructions faster through speech than text. Building a voice UI for an AI agent requires three capabilities: capturing microphone input, visualizing audio in real-time, and playing back agent audio responses.

Requesting Microphone Access

The Web Audio API requires explicit user permission. Wrap the permission request in a hook that tracks the microphone state.

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
import { useState, useCallback, useRef } from "react";

type MicStatus = "idle" | "requesting" | "active" | "denied" | "error";

function useMicrophone() {
  const [status, setStatus] = useState<MicStatus>("idle");
  const streamRef = useRef<MediaStream | null>(null);

  const requestAccess = useCallback(async () => {
    setStatus("requesting");
    try {
      const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
          echoCancellation: true,
          noiseSuppression: true,
          sampleRate: 16000,
        },
      });
      streamRef.current = stream;
      setStatus("active");
      return stream;
    } catch (err) {
      const name = (err as DOMException).name;
      setStatus(name === "NotAllowedError" ? "denied" : "error");
      return null;
    }
  }, []);

  const stopMic = useCallback(() => {
    streamRef.current?.getTracks().forEach((t) => t.stop());
    streamRef.current = null;
    setStatus("idle");
  }, []);

  return { status, requestAccess, stopMic, stream: streamRef };
}

The sampleRate: 16000 constraint is important. Most speech-to-text APIs expect 16kHz audio. Requesting it upfront avoids client-side resampling.

Recording Audio with MediaRecorder

The MediaRecorder API captures audio chunks from the microphone stream. Collect chunks in an array and assemble them into a Blob when recording stops.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
function useAudioRecorder() {
  const [isRecording, setIsRecording] = useState(false);
  const recorderRef = useRef<MediaRecorder | null>(null);
  const chunksRef = useRef<Blob[]>([]);

  const startRecording = useCallback((stream: MediaStream) => {
    chunksRef.current = [];
    const recorder = new MediaRecorder(stream, {
      mimeType: "audio/webm;codecs=opus",
    });

    recorder.ondataavailable = (e) => {
      if (e.data.size > 0) chunksRef.current.push(e.data);
    };

    recorder.start(250); // Collect data every 250ms
    recorderRef.current = recorder;
    setIsRecording(true);
  }, []);

  const stopRecording = useCallback((): Promise<Blob> => {
    return new Promise((resolve) => {
      const recorder = recorderRef.current;
      if (!recorder) return;

      recorder.onstop = () => {
        const blob = new Blob(chunksRef.current, {
          type: "audio/webm",
        });
        resolve(blob);
      };

      recorder.stop();
      setIsRecording(false);
    });
  }, []);

  return { isRecording, startRecording, stopRecording };
}

The 250ms interval in recorder.start(250) provides a good balance between responsiveness and efficiency. Smaller intervals create more chunks but allow for lower-latency streaming to the server.

Real-Time Waveform Visualization

A waveform gives visual feedback that audio is being captured. Use an AnalyserNode from the Web Audio API and draw the waveform on a Canvas element.

import { useEffect, useRef } from "react";

function WaveformVisualizer({
  stream,
  isActive,
}: {
  stream: MediaStream | null;
  isActive: boolean;
}) {
  const canvasRef = useRef<HTMLCanvasElement>(null);

  useEffect(() => {
    if (!stream || !isActive || !canvasRef.current) return;

    const audioCtx = new AudioContext();
    const analyser = audioCtx.createAnalyser();
    analyser.fftSize = 256;
    const source = audioCtx.createMediaStreamSource(stream);
    source.connect(analyser);

    const canvas = canvasRef.current;
    const ctx = canvas.getContext("2d")!;
    const bufferLength = analyser.frequencyBinCount;
    const dataArray = new Uint8Array(bufferLength);
    let animId: number;

    function draw() {
      animId = requestAnimationFrame(draw);
      analyser.getByteTimeDomainData(dataArray);

      ctx.fillStyle = "#f9fafb";
      ctx.fillRect(0, 0, canvas.width, canvas.height);
      ctx.lineWidth = 2;
      ctx.strokeStyle = "#3b82f6";
      ctx.beginPath();

      const sliceWidth = canvas.width / bufferLength;
      let x = 0;

      for (let i = 0; i < bufferLength; i++) {
        const v = dataArray[i] / 128.0;
        const y = (v * canvas.height) / 2;
        if (i === 0) ctx.moveTo(x, y);
        else ctx.lineTo(x, y);
        x += sliceWidth;
      }

      ctx.lineTo(canvas.width, canvas.height / 2);
      ctx.stroke();
    }

    draw();

    return () => {
      cancelAnimationFrame(animId);
      source.disconnect();
      audioCtx.close();
    };
  }, [stream, isActive]);

  return (
    <canvas
      ref={canvasRef}
      width={300}
      height={80}
      className="rounded-lg border"
    />
  );
}

Audio Playback for Agent Responses

When the agent returns an audio response, create an Audio element and manage playback state.

function useAudioPlayback() {
  const [isPlaying, setIsPlaying] = useState(false);
  const audioRef = useRef<HTMLAudioElement | null>(null);

  const play = useCallback((audioUrl: string) => {
    const audio = new Audio(audioUrl);
    audioRef.current = audio;
    audio.onended = () => setIsPlaying(false);
    audio.play();
    setIsPlaying(true);
  }, []);

  const stop = useCallback(() => {
    audioRef.current?.pause();
    audioRef.current = null;
    setIsPlaying(false);
  }, []);

  return { isPlaying, play, stop };
}

Putting It All Together

Combine the hooks into a voice interaction component with record, send, and playback controls.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

function VoiceAgentUI() {
  const mic = useMicrophone();
  const recorder = useAudioRecorder();
  const playback = useAudioPlayback();

  const handleRecord = async () => {
    const stream = await mic.requestAccess();
    if (stream) recorder.startRecording(stream);
  };

  const handleStop = async () => {
    const blob = await recorder.stopRecording();
    mic.stopMic();
    // Send blob to your agent API
    const formData = new FormData();
    formData.append("audio", blob, "recording.webm");
    const res = await fetch("/api/agent/voice", {
      method: "POST",
      body: formData,
    });
    const { audioUrl } = await res.json();
    playback.play(audioUrl);
  };

  return (
    <div className="flex flex-col items-center gap-4 p-6">
      <WaveformVisualizer
        stream={mic.stream.current}
        isActive={recorder.isRecording}
      />
      <button
        onClick={recorder.isRecording ? handleStop : handleRecord}
        className={`w-16 h-16 rounded-full ${
          recorder.isRecording ? "bg-red-500" : "bg-blue-600"
        } text-white`}
      >
        {recorder.isRecording ? "Stop" : "Mic"}
      </button>
    </div>
  );
}

FAQ

What audio format should I send to the speech-to-text API?

Most APIs accept audio/webm with Opus codec, which is what MediaRecorder produces by default in Chrome and Firefox. If your API requires WAV or PCM, use a library like audiobuffer-to-wav to convert the recorded blob before sending.

How do I handle the microphone permission prompt appearing multiple times?

The browser remembers permission grants per origin. If you serve your app over HTTPS, the user only sees the prompt once unless they explicitly revoke it. On localhost during development the prompt may reappear. Check navigator.permissions.query({ name: "microphone" }) to determine the current permission state before calling getUserMedia.

Can I stream audio to the agent in real-time instead of recording first?

Yes. Use the ondataavailable callback with a short interval (100-250ms) and send each chunk to a WebSocket endpoint as it arrives. This enables real-time speech-to-text and reduces perceived latency because the agent starts processing before the user finishes speaking.


#VoiceUI #MediaRecorderAPI #AudioVisualization #React #AIAgentInterface #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Vercel AI SDK 5: Tool Calling and Streaming Guide for React Apps

How to wire Vercel AI SDK 5 tool calls to a React UI with streaming, partial UI updates, and proper error handling that survives flaky network conditions.

Learn Agentic AI

Generative UI with AI Agents: Dynamically Creating React Components from Natural Language

Explore how the Vercel AI SDK's generativeUI capability lets AI agents stream fully interactive React components to users, replacing static text responses with dynamic, data-rich interfaces.

Learn Agentic AI

Building a Real-Time AI Dashboard: Live Metrics, Streaming Logs, and Agent Status

Build a production-grade real-time dashboard for monitoring AI agents, featuring live metrics pipelines, streaming log aggregation, agent health indicators, and efficient frontend rendering with React.

Learn Agentic AI

Streaming Text Display in React: Typewriter Effect for AI Agent Responses

Implement token-by-token streaming display for AI agent responses using Server-Sent Events, React state, and cursor animation. Includes markdown rendering during streaming.

Learn Agentic AI

Chat Message Rendering: Markdown, Code Blocks, Tables, and Rich Content

Build a rich message renderer for AI agent chat interfaces that handles markdown, syntax-highlighted code blocks, tables, and embedded images using React and TypeScript.

Learn Agentic AI

Building a Chat UI with React: Message Bubbles, Input, and Auto-Scroll

Learn how to build a production-quality chat interface for AI agents using React and TypeScript. Covers message bubble components, input handling, and smooth auto-scroll behavior.