Skip to content
Learn Agentic AI
Learn Agentic AI14 min read8 views

Capstone: Building a Voice-Enabled Appointment Booking System from Scratch

Build a complete voice-powered appointment booking system using Twilio, speech-to-text, text-to-speech, calendar integration, and intelligent booking logic with a FastAPI backend.

System Architecture

A voice-enabled appointment booking system takes an inbound phone call, converts speech to text, processes the request through an AI agent, books or modifies appointments in a calendar, and speaks the response back to the caller. This capstone integrates Twilio for telephony, Deepgram for speech-to-text, OpenAI for the conversational agent, ElevenLabs for natural text-to-speech, and a PostgreSQL database for appointment storage.

The call flow is: Twilio receives the call and opens a WebSocket media stream to your backend. Your FastAPI backend receives raw audio frames, streams them to Deepgram for real-time transcription, sends the transcript to an AI agent, receives the agent response, converts it to speech via ElevenLabs, and streams the audio back through the Twilio WebSocket.

Database Schema for Appointments

# models.py
from sqlalchemy import Column, String, DateTime, Boolean, ForeignKey
from sqlalchemy.dialects.postgresql import UUID
import uuid

class Provider(Base):
    __tablename__ = "providers"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    name = Column(String(200), nullable=False)
    specialty = Column(String(100))
    timezone = Column(String(50), default="America/New_York")

class TimeSlot(Base):
    __tablename__ = "time_slots"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    provider_id = Column(UUID(as_uuid=True), ForeignKey("providers.id"))
    start_time = Column(DateTime, nullable=False)
    end_time = Column(DateTime, nullable=False)
    is_available = Column(Boolean, default=True)

class Appointment(Base):
    __tablename__ = "appointments"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    slot_id = Column(UUID(as_uuid=True), ForeignKey("time_slots.id"))
    patient_name = Column(String(200), nullable=False)
    patient_phone = Column(String(20), nullable=False)
    reason = Column(String(500))
    confirmed = Column(Boolean, default=False)
    created_at = Column(DateTime, server_default="now()")

Twilio WebSocket Integration

Twilio sends a webhook when a call arrives. You respond with TwiML that opens a bidirectional media stream to your server.

flowchart LR
    CORPUS[("Pre-training corpus<br/>trillions of tokens")]
    FILTER["Quality filter and<br/>dedupe"]
    TOK["BPE tokenizer"]
    SHARD["Shard plus<br/>data parallel"]
    GPU{"GPU cluster<br/>FSDP or DeepSpeed"}
    CKPT[("Checkpoints<br/>every N steps")]
    LOSS["Loss curve plus<br/>eval gates"]
    SFT["SFT phase"]
    DPO["DPO or RLHF"]
    BASE([Base model])
    INSTR([Instruct model])
    CORPUS --> FILTER --> TOK --> SHARD --> GPU
    GPU --> CKPT --> LOSS
    LOSS --> BASE --> SFT --> DPO --> INSTR
    style GPU fill:#4f46e5,stroke:#4338ca,color:#fff
    style LOSS fill:#f59e0b,stroke:#d97706,color:#1f2937
    style INSTR fill:#059669,stroke:#047857,color:#fff
# routes/twilio.py
from fastapi import APIRouter, Request
from fastapi.responses import Response

router = APIRouter()

@router.post("/incoming-call")
async def handle_incoming_call(request: Request):
    twiml = """<?xml version="1.0" encoding="UTF-8"?>
    <Response>
        <Connect>
            <Stream url="wss://your-domain.com/media-stream" />
        </Connect>
    </Response>"""
    return Response(content=twiml, media_type="application/xml")

The WebSocket handler receives audio frames from Twilio and manages the conversation loop.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
# routes/media_stream.py
from fastapi import WebSocket
import json, base64

@app.websocket("/media-stream")
async def media_stream(ws: WebSocket):
    await ws.accept()
    stream_sid = None
    deepgram_ws = await connect_deepgram()
    conversation_history = []

    async for raw in ws.iter_text():
        msg = json.loads(raw)

        if msg["event"] == "start":
            stream_sid = msg["start"]["streamSid"]

        elif msg["event"] == "media":
            audio_bytes = base64.b64decode(msg["media"]["payload"])
            await deepgram_ws.send(audio_bytes)

        elif msg["event"] == "stop":
            break

    await deepgram_ws.close()

Booking Agent with Tool Calls

The AI agent uses tools to check availability, book slots, and cancel appointments.

# agents/booking_agent.py
from agents import Agent, function_tool
from datetime import datetime, timedelta

@function_tool
def check_availability(provider_name: str, date: str) -> str:
    """Check available time slots for a provider on a given date."""
    target = datetime.strptime(date, "%Y-%m-%d")
    slots = db.query(TimeSlot).join(Provider).filter(
        Provider.name.ilike(f"%{provider_name}%"),
        TimeSlot.start_time >= target,
        TimeSlot.start_time < target + timedelta(days=1),
        TimeSlot.is_available == True,
    ).order_by(TimeSlot.start_time).all()
    if not slots:
        return f"No availability for {provider_name} on {date}."
    times = [s.start_time.strftime("%I:%M %p") for s in slots]
    return f"Available times: {', '.join(times)}"

@function_tool
def book_appointment(slot_time: str, patient_name: str, reason: str) -> str:
    """Book an appointment at the specified time."""
    slot = db.query(TimeSlot).filter(
        TimeSlot.start_time == datetime.strptime(slot_time, "%Y-%m-%d %H:%M"),
        TimeSlot.is_available == True,
    ).first()
    if not slot:
        return "That time slot is no longer available."
    slot.is_available = False
    appt = Appointment(
        slot_id=slot.id, patient_name=patient_name, reason=reason, confirmed=True
    )
    db.add(appt)
    db.commit()
    return f"Appointment booked for {patient_name} at {slot_time}."

booking_agent = Agent(
    name="Booking Agent",
    instructions="""You are a friendly appointment booking assistant on a phone call.
    Always confirm the provider, date, time, and reason before booking.
    Speak naturally since the caller is listening to TTS output.
    Keep responses under 2 sentences for quick voice delivery.""",
    tools=[check_availability, book_appointment],
)

Speech-to-Text and Text-to-Speech Pipeline

Connect Deepgram for real-time STT with interim results, and ElevenLabs for low-latency TTS streaming.

# services/stt.py
import websockets, json, os

async def connect_deepgram():
    url = "wss://api.deepgram.com/v1/listen?model=nova-2&punctuate=true"
    ws = await websockets.connect(url, extra_headers={
        "Authorization": f"Token {os.environ['DEEPGRAM_API_KEY']}"
    })
    return ws

async def stream_tts(text: str) -> bytes:
    """Convert text to speech using ElevenLabs streaming API."""
    import httpx
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
            headers={"xi-api-key": os.environ["ELEVENLABS_API_KEY"]},
            json={"text": text, "model_id": "eleven_turbo_v2"},
        )
        return resp.content

Deployment and Testing

Deploy with Docker Compose using three services: the FastAPI backend, PostgreSQL, and an ngrok container for exposing your local WebSocket to Twilio during development. For production, deploy behind an nginx reverse proxy with TLS and configure Twilio to point to your domain.

Test the booking flow end-to-end by calling your Twilio number, requesting an appointment, confirming the details, and verifying the database record. Automated testing uses recorded audio fixtures played through the WebSocket handler.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

FAQ

How do I handle interruptions when the caller speaks over the AI?

Implement barge-in detection by monitoring the Deepgram transcript stream while TTS audio is playing. When new speech is detected, immediately stop the TTS playback by sending a clear message on the Twilio WebSocket, then process the new utterance.

What latency should I target for a natural voice experience?

Aim for under 800ms total round-trip from end-of-speech to start-of-response-audio. Deepgram Nova-2 typically returns final transcripts within 200ms, the LLM response takes 300-400ms, and ElevenLabs streaming TTS begins output within 200ms.

How do I prevent double-booking?

Use a database-level unique constraint or a SELECT FOR UPDATE lock on the time slot row. Wrap the availability check and booking in a single database transaction so that concurrent callers cannot book the same slot.


#CapstoneProject #VoiceAI #Twilio #AppointmentBooking #STTTTS #FullStackAI #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.