Skip to content
AI Voice Agents
AI Voice Agents11 min read0 views

Build a Voice Agent with Dialogflow CX + Gemini Live API (2026)

Combine Dialogflow CX's deterministic flows with Gemini Live API's bidirectional streaming for hybrid voice agents. Phone Gateway, generative fallback, real working integration.

TL;DR — Dialogflow CX flows are still the gold standard for compliant, deterministic conversation paths (insurance verification, ID-and-V). Gemini Live API is OpenAI Realtime's GCP equivalent — bidirectional WebSocket, native audio. Hybrid agents use CX for the regulated path and a Gemini Live fallback for everything else.

What you'll build

A Dialogflow CX agent that handles a structured intake flow (verify identity, capture appointment intent), with a "Generative Fallback" that hands the live audio stream to Gemini Live API for free-form Q&A. Phone Gateway provides the PSTN front-end. The Gemini Live bridge runs on Cloud Run.

Prerequisites

  1. GCP project with Dialogflow CX, Vertex AI APIs enabled.
  2. Service account with roles/dialogflow.client and roles/aiplatform.user.
  3. google-cloud-dialogflow-cx, google-genai Python packages.
  4. Cloud Run for the Gemini Live bridge (or Cloud Functions Gen2).

Architecture

flowchart TD
  PSTN[Caller] --> PG[Phone Gateway]
  PG --> CX[Dialogflow CX Flow]
  CX -->|deterministic intents| FB[Fulfillment webhook]
  FB --> CRM[(CRM)]
  CX -->|generative fallback| GLB[Gemini Live Bridge Cloud Run]
  GLB <-->|wss| GL[Gemini Live API]
  GLB --> CX
  CX -->|Chirp TTS| PG
  PG --> PSTN

Step 1 — Create the CX agent and Phone Gateway number

In the Conversational Agents console: New agent → name hybrid-voice → region us-central1 → enable Generative AI features. Under Manage → Integrations → Phone Gateway click Configure new number.

Step 2 — Define the deterministic flow

Build a single flow identity-verification with pages:

  • collect_dob (parameter @sys.date required)
  • collect_member_id (parameter @sys.number-sequence)
  • verify (calls webhook, transitions on success/failure)

Set the Default Welcome Intent to route into identity-verification. The fulfillment webhook is a Cloud Run service.

Step 3 — Add the Generative Fallback

In the Default Start Flow, set Event Handlers → no-match-default to a Generator:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

```yaml generator: prompt: | The user said: $conversation.last-user-utterance Reply briefly as a friendly receptionist. model: gemini-2.5-flash ```

This handles single-turn fallbacks. For multi-turn free-form, route to a webhook that hands off to Gemini Live.

Step 4 — The Gemini Live bridge (Cloud Run)

```python

bridge.py

import asyncio, base64, os from fastapi import FastAPI, WebSocket from google import genai from google.genai import types

client = genai.Client(vertexai=True, project=os.environ["PROJECT"], location="us-central1") app = FastAPI()

@app.websocket("/live") async def live(ws: WebSocket): await ws.accept() config = types.LiveConnectConfig( response_modalities=["AUDIO"], speech_config=types.SpeechConfig( voice_config=types.VoiceConfig( prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Charon"))), system_instruction="You are a friendly receptionist. Keep replies short.") async with client.aio.live.connect(model="gemini-2.5-flash-live", config=config) as session: async def from_caller(): async for frame in ws.iter_bytes(): await session.send_realtime_input(audio=types.Blob(data=frame, mime_type="audio/pcm;rate=16000")) async def to_caller(): async for resp in session.receive(): if resp.data: await ws.send_bytes(resp.data) await asyncio.gather(from_caller(), to_caller()) ```

gcloud run deploy bridge --source . — done in 90 seconds.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 5 — Hand off audio from CX to the bridge

Dialogflow CX doesn't expose raw audio to webhooks directly, so for free-form moments you "park" the CX session and hand the call to the bridge via the Phone Gateway's Custom Telephony Provider field. CX's live-agent-handoff event triggers a SIP REFER that the carrier routes to your Cloud Run WebSocket.

Step 6 — Telemetry

CX writes every turn to Conversation History; the bridge writes to Cloud Logging. Stitch sessions by call-sid (CX exposes it as a session parameter via Phone Gateway).

Pitfalls

  • Generator vs Live API: Generators are stateless single-turn — perfect for "I didn't understand" repair. Live API is for multi-turn open-ended.
  • Phone Gateway latency budget is tight; STT-TTS round-trip via CX adds ~400ms before the model even sees text. Use Gemini Live for low-latency segments.
  • Webhook timeouts: CX kills webhooks at 30s. Async work belongs on Pub/Sub.
  • Region binding: Phone Gateway numbers live in one region; cross-region calls add 60-100ms.
  • Generative AI safety filters can block medical answers — tune safety_settings on the Generator and Live config.

How CallSphere does this in production

CallSphere's Healthcare vertical (37 agents, 115+ DB tables) doesn't use Dialogflow CX — we built our own flow engine on FastAPI :8084 that routes between OpenAI Realtime and Anthropic Claude per turn based on PHI sensitivity. CX is excellent for teams that already have IVR investments; for greenfield, our managed product at $149/$499/$1499 (14-day trial, 22% affiliate) ships in days vs weeks. 90+ tools, 6 verticals.

FAQ

Q: When do I pick CX over a pure Live API stack? When you have hard compliance gates that must be deterministic (insurance verification, KYC). Live API alone isn't auditable enough for most regulated flows.

Q: Can Live API call tools mid-stream? Yes — Live API supports function calling natively; declare tools in LiveConnectConfig.

Q: What's the cost? CX is $0.007/request (text) or $0.06/min (voice with Phone Gateway). Live API on Vertex is $0.0006/sec audio in + $0.0024/sec audio out at flash-live rates.

Q: Does CX support barge-in? Yes — enable in Speech settings; default end-of-speech timeout is 500ms.

Q: Cross-region failover? Replicate the agent config via Terraform; Phone Gateway numbers can fail over to a backup CX agent in another region via the carrier-side route plan.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.