Modal vs Replicate vs Baseten for Voice AI: When Self-Host Wins
Serverless GPU at $0.59–$3.95 per hour looks tempting until you measure cold start. Here is the honest break-even for self-hosting voice TTS or STT vs paying Deepgram or ElevenLabs.
Serverless GPU at $0.59–$3.95 per hour looks tempting until you measure cold start. Here is the honest break-even for self-hosting voice TTS or STT vs paying Deepgram or ElevenLabs.
The cost problem
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]When voice teams hit ~$5k/month on Deepgram or ElevenLabs, someone always asks: "should we self-host an open-source STT or TTS on Modal/Replicate/Baseten?" The serverless GPU pricing — $1.10/hr for an A10, $2.10/hr for A100-40GB, $3.95/hr for H100 — looks dramatically cheaper than $0.0048/min × thousands of minutes.
But the simple "GPU $/hr ÷ minutes per hour" math is wrong. It ignores cold start, idle time, model loading, batching, and the engineering cost of running production GPU.
How each one prices it
Modal (May 2026):
- A10: $1.10/hour
- L40S: $1.95/hour
- A100-40GB: $2.10/hour
- A100-80GB: $2.50/hour
- H100: $3.95/hour
- Per-second billing
- $30/month free credits on Starter
Replicate:
- A100-80GB: ~$5.04/hour ($0.001400/sec) on custom deployments
- Per-second billing
- Cold start can run 30s–5min depending on model
- Many community models priced per-prediction
Baseten:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- T4: $0.63/hour
- A100: ~$3/hour
- H100: ~$5/hour
- B200: $9.98/hour
- Minute-level billing with idle time charged unless scaled to zero
Honest math: self-host Whisper-large-v3 STT
Pretend you have 100k minutes/month of streaming STT.
Buy from Deepgram Nova-3: 100k × $0.0048 = $480/month
Self-host Whisper-large-v3 on Modal A10:
- Real-time factor of 0.3× on A10 (one A10 handles ~3.3 concurrent streams continuously)
- Need ~5 A10s to hold peak concurrency at 100k min/mo with bursty traffic
- 5 × $1.10 × 730 = $4,015/mo, or ~$2,200/mo with autoscaling and 50% idle reduction
So self-hosting Whisper on Modal is 4–8× more expensive than Deepgram at this volume. Modal wins only if (a) Deepgram cannot meet your latency or accuracy bar, (b) you need on-prem / air-gapped, or (c) you scale past Deepgram's enterprise commit pricing.
Honest math: self-host Coqui XTTS or F5-TTS
100k minutes of agent speech ≈ 50M characters at typical talk speeds.
Buy from ElevenLabs Flash: 50M × $0.05 / 1k = $2,500/month Buy from Deepgram Aura-2: 50M × $0.030 / 1k = $1,500/month
Self-host F5-TTS on Modal A10:
- ~12× real-time on A10
- Peak concurrency for 100k min/mo evening peaks: 4–6 A10s sustained
- 5 × $1.10 × 730 = $4,015/mo, or ~$2,400/mo with autoscaling
So TTS self-host roughly matches ElevenLabs and is more expensive than Aura-2 at this scale. Self-host wins for TTS only when:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- You need a fully-cloned brand voice you cannot get from a vendor
- You need offline / air-gapped
- You are above 500k min/month and can amortize H100 commits
Where serverless GPU actually wins for voice
- Custom voice cloning — train a brand voice on your CEO once, serve thousands of calls.
- Niche language coverage — small languages where Deepgram/ElevenLabs do not support.
- Custom safety models — hallucination detection, PII redaction running alongside main inference.
- Embedding for retrieval — small models like bge-small-en, very cheap, very fast.
- Async post-call analytics — Whisper batch transcription, sentiment, coaching scores.
How CallSphere optimizes
CallSphere does not self-host live STT or TTS today — Deepgram, ElevenLabs, and OpenAI win on cost and latency at our 6-vertical scale (37 agents, 90+ tools, 115+ DB tables).
We do use Modal for two specific async paths:
- Healthcare post-call analytics uses GPT-4o-mini with prompt caching for the live transcription summary, but we run a smaller embedding model on Modal for retrieval — that is where the cost math swings.
- Salon GlamBook custom voice clones for premium-tier salon clients who want a branded receptionist voice that ElevenLabs would not host. Modal A10 with F5-TTS, ~$0.04 per 5-min call after batching.
The decision rule we follow: if a serverless GPU saves under 30% vs the equivalent vendor API, we do not self-host because the operational tax is real. The pricing tiers ($149 / $499 / $1499) plus the 14-day no-card trial keep us honest — we cannot afford to pay an ops team to babysit GPUs unless the savings are substantial.
Optimization checklist
- Always do the napkin math first: hours of GPU × $/hr vs vendor minutes × $/min.
- Measure your real concurrency p95, not p50 — that is what you must provision for.
- Add 15–25% to GPU cost for cold-start tax during traffic spikes.
- Use spot/preemptible GPUs only for batch — not for live voice.
- Modal autoscale-to-zero is great for bursty workloads, painful for steady ones.
- Replicate is best for prototyping; Modal/Baseten win on production reliability.
- Use Baseten for production-critical workloads where uptime contracts matter.
- Batch async work (post-call summaries) to amortize GPU.
- Quantize models to FP8/INT8 — 2× throughput on the same GPU.
- Re-evaluate monthly — H100/B200 prices keep falling.
FAQ
Is self-hosting STT cheaper than Deepgram? Below 1M min/month, almost never. Above that with negotiated commits, sometimes.
What about open-source Whisper vs Deepgram quality? Whisper-large-v3 matches Deepgram on broad English; Deepgram wins on streaming TTFT and on phone audio.
Should I use Replicate or Modal? Replicate for prototyping (no infra setup). Modal for production scale.
What is Baseten's value prop? Production reliability, enterprise SLAs, embedded engineering support — pay premium for less ops risk.
When should I switch to fully self-hosted GPUs? Above ~$25k/month in vendor inference, on stable workloads, with a dedicated ML platform team.
Sources
- Modal Pricing — https://modal.com/pricing
- Replicate Pricing — https://replicate.com/pricing
- Baseten Pricing — https://baseten.co/pricing
- HostFleet serverless GPU comparison — https://hostfleet.net/serverless-gpu-pricing-matrix-2026/
- Spheron GPU Cloud Pricing 2026 — https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.