Build a Voice Agent with Hume EVI 3: Emotionally Intelligent Voice (2026)
Hume EVI 3 is one model for STT+LLM+TTS with prosody-aware reactions. Build a customizable speech-to-speech agent — TypeScript code, voice prompting, pitfalls.
TL;DR — Hume EVI 3 is a single speech-language model that handles transcription, language, and speech in one shot — and it tracks the user's vocal emotion in real time. You can describe ANY voice in a prompt ("a warm 40-year-old British woman"), point it at Claude or Gemini, and get sub-300ms emotionally aware replies.
What you'll build
A Next.js app using Hume's TypeScript SDK to open an EVI 3 WebSocket session, render the live emotion meter, and let users design a voice via plain-English prompt — all under 250 lines.
Architecture
flowchart LR
MIC[Browser mic] -- WS audio --> EV[Hume EVI 3]
EV -- prosody + transcript --> APP[Your client]
EV -- voice audio --> APP --> SP[Speakers]
EV -- llm_call --> CLD[Claude 4 / Gemini 2.5]
Step 1 — Install
```bash npm i hume @humeai/voice-react
server-side only:
npm i hume jsonwebtoken ```
Step 2 — Mint an access token (server)
```ts // app/api/hume-token/route.ts import { fetchAccessToken } from "@humeai/voice";
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
export async function GET() { const accessToken = await fetchAccessToken({ apiKey: process.env.HUME_API_KEY!, secretKey: process.env.HUME_SECRET_KEY!, }); return Response.json({ accessToken }); } ```
Step 3 — Configure the EVI
In platform.hume.ai → EVI → Configs, create a config with:
- Model:
evi-3 - Voice description (prompt):
A warm, calm 35-year-old American woman who sounds like a kind nurse. - LLM:
anthropic/claude-3-5-sonnet(orgoogle/gemini-2.5-flash) - System prompt:
You are Ava, a clinic concierge. Adapt tone to the caller's emotion. - Tools: optional (function calls work like OpenAI)
Copy the resulting configId.
Step 4 — Client provider
```tsx "use client"; import { VoiceProvider, useVoice } from "@humeai/voice-react";
export default function Page() {
const [token, setToken] = useState<string | null>(null);
useEffect(() => { fetch("/api/hume-token").then(r=>r.json())
.then(j=>setToken(j.accessToken)); }, []);
if (!token) return null;
return (
<VoiceProvider auth={{ type: "accessToken", value: token }}
configId={process.env.NEXT_PUBLIC_HUME_CONFIG_ID!}>
Step 5 — Render the emotion meter
```tsx function Concierge() { const { connect, disconnect, status, messages } = useVoice(); const last = messages[messages.length - 1]; const top3 = last?.models?.prosody?.scores ? Object.entries(last.models.prosody.scores) .sort((a, b) => (b[1] as number) - (a[1] as number)).slice(0, 3) : []; return ( <> <button onClick={status.value === "connected" ? disconnect : connect}> {status.value === "connected" ? "Hang up" : "Talk"}
- {top3.map(([k, v]) =>
- {k}: {(v as number).toFixed(2)} )}
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Design a voice in code
```ts // node script import { HumeClient } from "hume"; const hume = new HumeClient({ apiKey: process.env.HUME_API_KEY! }); const voice = await hume.empathicVoice.customVoices.create({ name: "Sunrise Ava", baseVoice: "ITO", parameterModel: "20240715-4parameter", parameters: { gender: 2, assertiveness: -1, buoyancy: 1, confidence: 0 }, }); console.log(voice.id); ```
Step 7 — Hook tool calls
EVI 3 tool events look like {type: "tool_call", name, parameters, tool_call_id} — handle in onMessage and respond with {type: "tool_response", tool_call_id, content}.
Pitfalls
- WebSocket only: No HTTP REST surface for EVI; budget your reconnect logic.
- Voice description quality: Vague prompts ("nice voice") yield generic output — be specific (age, accent, energy).
- Latency vs realism:
evi-3is ~280ms p50; switching toevi-3-fastdrops to ~180ms with slightly less expressive prosody. - Multi-language: Excellent on EN; for 60+ languages pair EVI 3 STT with Soniox or Universal-3.
How CallSphere does this
CallSphere uses EVI 3 in the Behavioral Health vertical where emotional adaptation is core to UX — running across 37 agents · 90+ tools · 115+ DB tables · 6 verticals. $149/$499/$1,499 · 14-day trial · 22% affiliate.
FAQ
Cost? Per-minute pricing on EVI 3 is comparable to GPT-4o Realtime — ~$0.18/min combined.
Custom LLM? Yes — point the config at OpenAI / Anthropic / Google / Mistral via the dashboard.
Voice cloning? With 30 seconds of audio, EVI 3 captures timbre, rhythm, and tone.
Phone calls? Twilio Media Streams bridge ships in the docs — wire WS-to-WS and you have PSTN.
Sources
- Hume Blog - Introducing EVI 3 - https://www.hume.ai/blog/introducing-evi-3
- Hume Blog - Announcing EVI 3 API - https://www.hume.ai/blog/announcing-evi-3-api
- Hume API Docs - Speech-to-Speech (EVI) - https://dev.hume.ai/docs/speech-to-speech-evi/overview
- Vercel Template - Hume Empathic Voice Starter - https://vercel.com/templates/next.js/empathic-voice-interface-starter
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.