Serverless GPU at $0.59–$3.95 per hour looks tempting until you measure cold start. Here is the honest break-even for self-hosting voice TTS or STT vs paying Deepgram or ElevenLabs.

The cost problem

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]

CallSphere reference architecture

When voice teams hit ~$5k/month on Deepgram or ElevenLabs, someone always asks: "should we self-host an open-source STT or TTS on Modal/Replicate/Baseten?" The serverless GPU pricing — $1.10/hr for an A10, $2.10/hr for A100-40GB, $3.95/hr for H100 — looks dramatically cheaper than $0.0048/min × thousands of minutes.

But the simple "GPU $/hr ÷ minutes per hour" math is wrong. It ignores cold start, idle time, model loading, batching, and the engineering cost of running production GPU.

How each one prices it

Modal (May 2026):

A10: $1.10/hour
L40S: $1.95/hour
A100-40GB: $2.10/hour
A100-80GB: $2.50/hour
H100: $3.95/hour
Per-second billing
$30/month free credits on Starter

Replicate:

A100-80GB: ~$5.04/hour ($0.001400/sec) on custom deployments
Per-second billing
Cold start can run 30s–5min depending on model
Many community models priced per-prediction

Baseten:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

T4: $0.63/hour
A100: ~$3/hour
H100: ~$5/hour
B200: $9.98/hour
Minute-level billing with idle time charged unless scaled to zero

Honest math: self-host Whisper-large-v3 STT

Pretend you have 100k minutes/month of streaming STT.

Buy from Deepgram Nova-3: 100k × $0.0048 = $480/month

Self-host Whisper-large-v3 on Modal A10:

Real-time factor of 0.3× on A10 (one A10 handles ~3.3 concurrent streams continuously)
Need ~5 A10s to hold peak concurrency at 100k min/mo with bursty traffic
5 × $1.10 × 730 = $4,015/mo, or ~$2,200/mo with autoscaling and 50% idle reduction

So self-hosting Whisper on Modal is 4–8× more expensive than Deepgram at this volume. Modal wins only if (a) Deepgram cannot meet your latency or accuracy bar, (b) you need on-prem / air-gapped, or (c) you scale past Deepgram's enterprise commit pricing.

Honest math: self-host Coqui XTTS or F5-TTS

100k minutes of agent speech ≈ 50M characters at typical talk speeds.

Buy from ElevenLabs Flash: 50M × $0.05 / 1k = $2,500/month Buy from Deepgram Aura-2: 50M × $0.030 / 1k = $1,500/month

Self-host F5-TTS on Modal A10:

~12× real-time on A10
Peak concurrency for 100k min/mo evening peaks: 4–6 A10s sustained
5 × $1.10 × 730 = $4,015/mo, or ~$2,400/mo with autoscaling

So TTS self-host roughly matches ElevenLabs and is more expensive than Aura-2 at this scale. Self-host wins for TTS only when:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

You need a fully-cloned brand voice you cannot get from a vendor
You need offline / air-gapped
You are above 500k min/month and can amortize H100 commits

Where serverless GPU actually wins for voice

Custom voice cloning — train a brand voice on your CEO once, serve thousands of calls.
Niche language coverage — small languages where Deepgram/ElevenLabs do not support.
Custom safety models — hallucination detection, PII redaction running alongside main inference.
Embedding for retrieval — small models like bge-small-en, very cheap, very fast.
Async post-call analytics — Whisper batch transcription, sentiment, coaching scores.

How CallSphere optimizes

CallSphere does not self-host live STT or TTS today — Deepgram, ElevenLabs, and OpenAI win on cost and latency at our 6-vertical scale (37 agents, 90+ tools, 115+ DB tables).

We do use Modal for two specific async paths:

Healthcare post-call analytics uses GPT-4o-mini with prompt caching for the live transcription summary, but we run a smaller embedding model on Modal for retrieval — that is where the cost math swings.
Salon GlamBook custom voice clones for premium-tier salon clients who want a branded receptionist voice that ElevenLabs would not host. Modal A10 with F5-TTS, ~$0.04 per 5-min call after batching.

The decision rule we follow: if a serverless GPU saves under 30% vs the equivalent vendor API, we do not self-host because the operational tax is real. The pricing tiers ($149 / $499 / $1499) plus the 14-day no-card trial keep us honest — we cannot afford to pay an ops team to babysit GPUs unless the savings are substantial.

Optimization checklist

Always do the napkin math first: hours of GPU × $/hr vs vendor minutes × $/min.
Measure your real concurrency p95, not p50 — that is what you must provision for.
Add 15–25% to GPU cost for cold-start tax during traffic spikes.
Use spot/preemptible GPUs only for batch — not for live voice.
Modal autoscale-to-zero is great for bursty workloads, painful for steady ones.
Replicate is best for prototyping; Modal/Baseten win on production reliability.
Use Baseten for production-critical workloads where uptime contracts matter.
Batch async work (post-call summaries) to amortize GPU.
Quantize models to FP8/INT8 — 2× throughput on the same GPU.
Re-evaluate monthly — H100/B200 prices keep falling.

FAQ

Is self-hosting STT cheaper than Deepgram? Below 1M min/month, almost never. Above that with negotiated commits, sometimes.

What about open-source Whisper vs Deepgram quality? Whisper-large-v3 matches Deepgram on broad English; Deepgram wins on streaming TTFT and on phone audio.

Should I use Replicate or Modal? Replicate for prototyping (no infra setup). Modal for production scale.

What is Baseten's value prop? Production reliability, enterprise SLAs, embedded engineering support — pay premium for less ops risk.

When should I switch to fully self-hosted GPUs? Above ~$25k/month in vendor inference, on stable workloads, with a dedicated ML platform team.

Sources

Modal Pricing — https://modal.com/pricing
Replicate Pricing — https://replicate.com/pricing
Baseten Pricing — https://baseten.co/pricing
HostFleet serverless GPU comparison — https://hostfleet.net/serverless-gpu-pricing-matrix-2026/
Spheron GPU Cloud Pricing 2026 — https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/

Modal vs Replicate vs Baseten for Voice AI: When Self-Host Wins

The cost problem

How each one prices it

Honest math: self-host Whisper-large-v3 STT

Honest math: self-host Coqui XTTS or F5-TTS

Where serverless GPU actually wins for voice

How CallSphere optimizes

Optimization checklist

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

GPU Spot vs On-Demand for Self-Hosted Voice Models in 2026

Flash Attention 3: How It Works and What It Enabled

Ring Attention Explained: Distributing Attention Across GPUs

PyTorch Profiler in Production: Finding the Real Bottleneck

Cost of Compute 2026: H200, B200, MI325X, and the TPU v6 Trendline

Custom CUDA Kernels via Triton for AI Workloads