Building a Voice UI for AI Agents: Microphone Input, Waveform Visualization, and Playback

Why Voice Interfaces for Agents

Voice interaction removes the typing bottleneck. Users can describe complex problems, provide context, and issue multi-step instructions faster through speech than text. Building a voice UI for an AI agent requires three capabilities: capturing microphone input, visualizing audio in real-time, and playing back agent audio responses.

Requesting Microphone Access

The Web Audio API requires explicit user permission. Wrap the permission request in a hook that tracks the microphone state.

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

import { useState, useCallback, useRef } from "react";

type MicStatus = "idle" | "requesting" | "active" | "denied" | "error";

function useMicrophone() {
  const [status, setStatus] = useState<MicStatus>("idle");
  const streamRef = useRef<MediaStream | null>(null);

  const requestAccess = useCallback(async () => {
    setStatus("requesting");
    try {
      const stream = await navigator.mediaDevices.getUserMedia({
        audio: {
          echoCancellation: true,
          noiseSuppression: true,
          sampleRate: 16000,
        },
      });
      streamRef.current = stream;
      setStatus("active");
      return stream;
    } catch (err) {
      const name = (err as DOMException).name;
      setStatus(name === "NotAllowedError" ? "denied" : "error");
      return null;
    }
  }, []);

  const stopMic = useCallback(() => {
    streamRef.current?.getTracks().forEach((t) => t.stop());
    streamRef.current = null;
    setStatus("idle");
  }, []);

  return { status, requestAccess, stopMic, stream: streamRef };
}

The sampleRate: 16000 constraint is important. Most speech-to-text APIs expect 16kHz audio. Requesting it upfront avoids client-side resampling.

Recording Audio with MediaRecorder

The MediaRecorder API captures audio chunks from the microphone stream. Collect chunks in an array and assemble them into a Blob when recording stops.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

function useAudioRecorder() {
  const [isRecording, setIsRecording] = useState(false);
  const recorderRef = useRef<MediaRecorder | null>(null);
  const chunksRef = useRef<Blob[]>([]);

  const startRecording = useCallback((stream: MediaStream) => {
    chunksRef.current = [];
    const recorder = new MediaRecorder(stream, {
      mimeType: "audio/webm;codecs=opus",
    });

    recorder.ondataavailable = (e) => {
      if (e.data.size > 0) chunksRef.current.push(e.data);
    };

    recorder.start(250); // Collect data every 250ms
    recorderRef.current = recorder;
    setIsRecording(true);
  }, []);

  const stopRecording = useCallback((): Promise<Blob> => {
    return new Promise((resolve) => {
      const recorder = recorderRef.current;
      if (!recorder) return;

      recorder.onstop = () => {
        const blob = new Blob(chunksRef.current, {
          type: "audio/webm",
        });
        resolve(blob);
      };

      recorder.stop();
      setIsRecording(false);
    });
  }, []);

  return { isRecording, startRecording, stopRecording };
}

The 250ms interval in recorder.start(250) provides a good balance between responsiveness and efficiency. Smaller intervals create more chunks but allow for lower-latency streaming to the server.

Real-Time Waveform Visualization

A waveform gives visual feedback that audio is being captured. Use an AnalyserNode from the Web Audio API and draw the waveform on a Canvas element.

import { useEffect, useRef } from "react";

function WaveformVisualizer({
  stream,
  isActive,
}: {
  stream: MediaStream | null;
  isActive: boolean;
}) {
  const canvasRef = useRef<HTMLCanvasElement>(null);

  useEffect(() => {
    if (!stream || !isActive || !canvasRef.current) return;

    const audioCtx = new AudioContext();
    const analyser = audioCtx.createAnalyser();
    analyser.fftSize = 256;
    const source = audioCtx.createMediaStreamSource(stream);
    source.connect(analyser);

    const canvas = canvasRef.current;
    const ctx = canvas.getContext("2d")!;
    const bufferLength = analyser.frequencyBinCount;
    const dataArray = new Uint8Array(bufferLength);
    let animId: number;

    function draw() {
      animId = requestAnimationFrame(draw);
      analyser.getByteTimeDomainData(dataArray);

      ctx.fillStyle = "#f9fafb";
      ctx.fillRect(0, 0, canvas.width, canvas.height);
      ctx.lineWidth = 2;
      ctx.strokeStyle = "#3b82f6";
      ctx.beginPath();

      const sliceWidth = canvas.width / bufferLength;
      let x = 0;

      for (let i = 0; i < bufferLength; i++) {
        const v = dataArray[i] / 128.0;
        const y = (v * canvas.height) / 2;
        if (i === 0) ctx.moveTo(x, y);
        else ctx.lineTo(x, y);
        x += sliceWidth;
      }

      ctx.lineTo(canvas.width, canvas.height / 2);
      ctx.stroke();
    }

    draw();

    return () => {
      cancelAnimationFrame(animId);
      source.disconnect();
      audioCtx.close();
    };
  }, [stream, isActive]);

  return (
    <canvas
      ref={canvasRef}
      width={300}
      height={80}
      className="rounded-lg border"
    />
  );
}

Audio Playback for Agent Responses

When the agent returns an audio response, create an Audio element and manage playback state.

function useAudioPlayback() {
  const [isPlaying, setIsPlaying] = useState(false);
  const audioRef = useRef<HTMLAudioElement | null>(null);

  const play = useCallback((audioUrl: string) => {
    const audio = new Audio(audioUrl);
    audioRef.current = audio;
    audio.onended = () => setIsPlaying(false);
    audio.play();
    setIsPlaying(true);
  }, []);

  const stop = useCallback(() => {
    audioRef.current?.pause();
    audioRef.current = null;
    setIsPlaying(false);
  }, []);

  return { isPlaying, play, stop };
}

Putting It All Together

Combine the hooks into a voice interaction component with record, send, and playback controls.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

function VoiceAgentUI() {
  const mic = useMicrophone();
  const recorder = useAudioRecorder();
  const playback = useAudioPlayback();

  const handleRecord = async () => {
    const stream = await mic.requestAccess();
    if (stream) recorder.startRecording(stream);
  };

  const handleStop = async () => {
    const blob = await recorder.stopRecording();
    mic.stopMic();
    // Send blob to your agent API
    const formData = new FormData();
    formData.append("audio", blob, "recording.webm");
    const res = await fetch("/api/agent/voice", {
      method: "POST",
      body: formData,
    });
    const { audioUrl } = await res.json();
    playback.play(audioUrl);
  };

  return (
    <div className="flex flex-col items-center gap-4 p-6">
      <WaveformVisualizer
        stream={mic.stream.current}
        isActive={recorder.isRecording}
      />
      <button
        onClick={recorder.isRecording ? handleStop : handleRecord}
        className={`w-16 h-16 rounded-full ${
          recorder.isRecording ? "bg-red-500" : "bg-blue-600"
        } text-white`}
      >
        {recorder.isRecording ? "Stop" : "Mic"}
      </button>
    </div>
  );
}

FAQ

What audio format should I send to the speech-to-text API?

Most APIs accept audio/webm with Opus codec, which is what MediaRecorder produces by default in Chrome and Firefox. If your API requires WAV or PCM, use a library like audiobuffer-to-wav to convert the recorded blob before sending.

How do I handle the microphone permission prompt appearing multiple times?

The browser remembers permission grants per origin. If you serve your app over HTTPS, the user only sees the prompt once unless they explicitly revoke it. On localhost during development the prompt may reappear. Check navigator.permissions.query({ name: "microphone" }) to determine the current permission state before calling getUserMedia.

Can I stream audio to the agent in real-time instead of recording first?

Yes. Use the ondataavailable callback with a short interval (100-250ms) and send each chunk to a WebSocket endpoint as it arrives. This enables real-time speech-to-text and reduces perceived latency because the agent starts processing before the user finishes speaking.

#VoiceUI #MediaRecorderAPI #AudioVisualization #React #AIAgentInterface #AgenticAI #LearnAI #AIEngineering

Building a Voice UI for AI Agents: Microphone Input, Waveform Visualization, and Playback

Why Voice Interfaces for Agents

Requesting Microphone Access

Recording Audio with MediaRecorder

Real-Time Waveform Visualization

Audio Playback for Agent Responses

Putting It All Together

FAQ

What audio format should I send to the speech-to-text API?

How do I handle the microphone permission prompt appearing multiple times?

Can I stream audio to the agent in real-time instead of recording first?

Try CallSphere AI Voice Agents

Related Articles You May Like

Vercel AI SDK 5: Tool Calling and Streaming Guide for React Apps

Generative UI with AI Agents: Dynamically Creating React Components from Natural Language

Building a Real-Time AI Dashboard: Live Metrics, Streaming Logs, and Agent Status

Streaming Text Display in React: Typewriter Effect for AI Agent Responses

Chat Message Rendering: Markdown, Code Blocks, Tables, and Rich Content

Building a Chat UI with React: Message Bubbles, Input, and Auto-Scroll