Build a Voice Agent with Coqui TTS XTTS-v2 (Voice Cloning, Local)
XTTS-v2 clones a voice from 6 seconds of audio and speaks 17 languages. Here's how to wire it into a real voice agent with faster-whisper STT and a local LLM — no API keys.
TL;DR — XTTS-v2 is the open voice-cloning model worth running. The original Coqui org wound down, but
coqui-tts(a community fork) is on 0.28 with prebuilt wheels for macOS and Windows. Six seconds of clean audio gives you a usable clone.
What you'll build
A voice agent that answers in your voice. Mic in → faster-whisper → Ollama → XTTS-v2 (cloning your reference clip) → speaker out. Useful for accessibility, language tutoring, and on-brand IVR demos.
Prerequisites
- Python 3.11 (XTTS pinned wheels do not yet build cleanly on 3.13).
pip install coqui-tts faster-whisper sounddevice numpy ollama.- NVIDIA GPU with 6 GB+ VRAM strongly recommended (XTTS on CPU = 12x realtime).
- A 6–15 second WAV of the voice you want to clone (clean, mono, 22050 Hz, no music).
- Ollama running with a small model (
ollama pull llama3.2:3b).
Architecture
flowchart LR
MIC[Microphone] --> STT[faster-whisper]
STT --> LLM[Ollama llama3.2:3b]
LLM --> XTTS[XTTS-v2 + speaker.wav]
XTTS --> SPK[Speaker]
Step 1 — Install the maintained fork
```bash python3.11 -m venv .venv && source .venv/bin/activate pip install -U coqui-tts torch torchaudio ```
Avoid the abandoned TTS package on PyPI — it pins old transformers and numpy versions that conflict with everything in 2026.
Step 2 — Verify the clone with a 30-second test
```python import torch from TTS.api import TTS device = "cuda" if torch.cuda.is_available() else "cpu" tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device) tts.tts_to_file( text="Hello, this is the cloned voice running fully locally.", speaker_wav="my_voice_6s.wav", language="en", file_path="clone_test.wav") ```
If clone_test.wav sounds recognisable as you (not a generic narrator), the clone took.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Cache the speaker embedding (critical for latency)
XTTS computes a speaker embedding on every call by default. Pre-compute and reuse it:
```python gpt_cond, speaker_emb = tts.synthesizer.tts_model.get_conditioning_latents( audio_path=["my_voice_6s.wav"]) ```
Now each subsequent call drops the conditioning step (~1.2 s saved per utterance).
Step 4 — Stream synthesis with inference_stream
```python import sounddevice as sd, numpy as np
def speak(text): chunks = tts.synthesizer.tts_model.inference_stream( text, "en", gpt_cond, speaker_emb, stream_chunk_size=20) # smaller = lower TTFB with sd.OutputStream(samplerate=24000, channels=1, dtype="float32") as out: for chunk in chunks: out.write(chunk.cpu().numpy().astype(np.float32)) ```
stream_chunk_size=20 gives ~250 ms time-to-first-audio on an RTX 4090.
Step 5 — STT + LLM glue
```python from faster_whisper import WhisperModel import ollama stt = WhisperModel("small.en", device="cuda", compute_type="float16") history = [{"role":"system","content":"You are a friendly, brief voice assistant."}]
def turn(audio_int16): audio = audio_int16.astype(np.float32) / 32768 segs, _ = stt.transcribe(audio, language="en", vad_filter=True) user = " ".join(s.text for s in segs).strip() if not user: return history.append({"role":"user","content":user}) r = ollama.chat(model="llama3.2:3b", messages=history, options={"num_predict":140}) history.append(r["message"]) speak(r["message"]["content"]) ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Mic loop with VAD
```python def record(threshold=0.012, max_s=8): frames, silent = [], 0 with sd.InputStream(samplerate=16000, channels=1, dtype="int16") as s: while silent < 9000 and len(frames) * 1600 < 16000 * max_s: chunk, _ = s.read(1600); frames.append(chunk) rms = np.sqrt(np.mean((chunk.astype(np.float32)/32768)**2)) silent = silent + 1600 if rms < threshold else 0 return np.concatenate(frames).flatten()
while True: turn(record()) ```
Common pitfalls
- CPU is too slow. XTTS is 1x realtime on a 4090, ~0.25x on M2 Max — needs GPU for live agents.
- License. XTTS-v2 weights are CPML (non-commercial). Use ElevenLabs or Voxtral TTS for commercial production.
- Speaker embedding drift. Cache and reuse — recomputing per turn destroys latency.
How CallSphere does this in production
CallSphere's 37 agents across 6 verticals use commercial voice models (ElevenLabs, OpenAI) for production calls because XTTS's licence excludes commercial use. We use XTTS for internal demo personas and offline UX research only. Healthcare's 14-tool FastAPI :8084 stack uses OpenAI Realtime; OneRoof's 10 specialists use ElevenLabs over WebRTC. Pricing $149/$499/$1499 flat — 14-day trial · 22% affiliate · /pricing.
FAQ
Is XTTS-v2 commercially usable? No — Coqui Public Model License is non-commercial. For paid SaaS, switch to Voxtral or ElevenLabs.
How much reference audio do I need? 6 seconds works; 15+ is better.
Can it do emotion? Limited — it tracks the reference's tone. For real emotion control, use prompt-driven prosody.
Languages? 17 (EN/ES/FR/DE/IT/PT/PL/TR/RU/NL/CS/AR/ZH/JA/HU/KO/HI).
Streaming TTS? Yes — inference_stream since 0.22.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.