Capstone: Building a Voice-Enabled Appointment Booking System from Scratch

System Architecture

A voice-enabled appointment booking system takes an inbound phone call, converts speech to text, processes the request through an AI agent, books or modifies appointments in a calendar, and speaks the response back to the caller. This capstone integrates Twilio for telephony, Deepgram for speech-to-text, OpenAI for the conversational agent, ElevenLabs for natural text-to-speech, and a PostgreSQL database for appointment storage.

The call flow is: Twilio receives the call and opens a WebSocket media stream to your backend. Your FastAPI backend receives raw audio frames, streams them to Deepgram for real-time transcription, sends the transcript to an AI agent, receives the agent response, converts it to speech via ElevenLabs, and streams the audio back through the Twilio WebSocket.

Database Schema for Appointments

# models.py
from sqlalchemy import Column, String, DateTime, Boolean, ForeignKey
from sqlalchemy.dialects.postgresql import UUID
import uuid

class Provider(Base):
    __tablename__ = "providers"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    name = Column(String(200), nullable=False)
    specialty = Column(String(100))
    timezone = Column(String(50), default="America/New_York")

class TimeSlot(Base):
    __tablename__ = "time_slots"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    provider_id = Column(UUID(as_uuid=True), ForeignKey("providers.id"))
    start_time = Column(DateTime, nullable=False)
    end_time = Column(DateTime, nullable=False)
    is_available = Column(Boolean, default=True)

class Appointment(Base):
    __tablename__ = "appointments"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    slot_id = Column(UUID(as_uuid=True), ForeignKey("time_slots.id"))
    patient_name = Column(String(200), nullable=False)
    patient_phone = Column(String(20), nullable=False)
    reason = Column(String(500))
    confirmed = Column(Boolean, default=False)
    created_at = Column(DateTime, server_default="now()")

Twilio WebSocket Integration

Twilio sends a webhook when a call arrives. You respond with TwiML that opens a bidirectional media stream to your server.

flowchart LR
    CORPUS[("Pre-training corpus<br/>trillions of tokens")]
    FILTER["Quality filter and<br/>dedupe"]
    TOK["BPE tokenizer"]
    SHARD["Shard plus<br/>data parallel"]
    GPU{"GPU cluster<br/>FSDP or DeepSpeed"}
    CKPT[("Checkpoints<br/>every N steps")]
    LOSS["Loss curve plus<br/>eval gates"]
    SFT["SFT phase"]
    DPO["DPO or RLHF"]
    BASE([Base model])
    INSTR([Instruct model])
    CORPUS --> FILTER --> TOK --> SHARD --> GPU
    GPU --> CKPT --> LOSS
    LOSS --> BASE --> SFT --> DPO --> INSTR
    style GPU fill:#4f46e5,stroke:#4338ca,color:#fff
    style LOSS fill:#f59e0b,stroke:#d97706,color:#1f2937
    style INSTR fill:#059669,stroke:#047857,color:#fff

# routes/twilio.py
from fastapi import APIRouter, Request
from fastapi.responses import Response

router = APIRouter()

@router.post("/incoming-call")
async def handle_incoming_call(request: Request):
    twiml = """<?xml version="1.0" encoding="UTF-8"?>
    <Response>
        <Connect>
            <Stream url="wss://your-domain.com/media-stream" />
        </Connect>
    </Response>"""
    return Response(content=twiml, media_type="application/xml")

The WebSocket handler receives audio frames from Twilio and manages the conversation loop.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

# routes/media_stream.py
from fastapi import WebSocket
import json, base64

@app.websocket("/media-stream")
async def media_stream(ws: WebSocket):
    await ws.accept()
    stream_sid = None
    deepgram_ws = await connect_deepgram()
    conversation_history = []

    async for raw in ws.iter_text():
        msg = json.loads(raw)

        if msg["event"] == "start":
            stream_sid = msg["start"]["streamSid"]

        elif msg["event"] == "media":
            audio_bytes = base64.b64decode(msg["media"]["payload"])
            await deepgram_ws.send(audio_bytes)

        elif msg["event"] == "stop":
            break

    await deepgram_ws.close()

Booking Agent with Tool Calls

The AI agent uses tools to check availability, book slots, and cancel appointments.

# agents/booking_agent.py
from agents import Agent, function_tool
from datetime import datetime, timedelta

@function_tool
def check_availability(provider_name: str, date: str) -> str:
    """Check available time slots for a provider on a given date."""
    target = datetime.strptime(date, "%Y-%m-%d")
    slots = db.query(TimeSlot).join(Provider).filter(
        Provider.name.ilike(f"%{provider_name}%"),
        TimeSlot.start_time >= target,
        TimeSlot.start_time < target + timedelta(days=1),
        TimeSlot.is_available == True,
    ).order_by(TimeSlot.start_time).all()
    if not slots:
        return f"No availability for {provider_name} on {date}."
    times = [s.start_time.strftime("%I:%M %p") for s in slots]
    return f"Available times: {', '.join(times)}"

@function_tool
def book_appointment(slot_time: str, patient_name: str, reason: str) -> str:
    """Book an appointment at the specified time."""
    slot = db.query(TimeSlot).filter(
        TimeSlot.start_time == datetime.strptime(slot_time, "%Y-%m-%d %H:%M"),
        TimeSlot.is_available == True,
    ).first()
    if not slot:
        return "That time slot is no longer available."
    slot.is_available = False
    appt = Appointment(
        slot_id=slot.id, patient_name=patient_name, reason=reason, confirmed=True
    )
    db.add(appt)
    db.commit()
    return f"Appointment booked for {patient_name} at {slot_time}."

booking_agent = Agent(
    name="Booking Agent",
    instructions="""You are a friendly appointment booking assistant on a phone call.
    Always confirm the provider, date, time, and reason before booking.
    Speak naturally since the caller is listening to TTS output.
    Keep responses under 2 sentences for quick voice delivery.""",
    tools=[check_availability, book_appointment],
)

Speech-to-Text and Text-to-Speech Pipeline

Connect Deepgram for real-time STT with interim results, and ElevenLabs for low-latency TTS streaming.

# services/stt.py
import websockets, json, os

async def connect_deepgram():
    url = "wss://api.deepgram.com/v1/listen?model=nova-2&punctuate=true"
    ws = await websockets.connect(url, extra_headers={
        "Authorization": f"Token {os.environ['DEEPGRAM_API_KEY']}"
    })
    return ws

async def stream_tts(text: str) -> bytes:
    """Convert text to speech using ElevenLabs streaming API."""
    import httpx
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}/stream",
            headers={"xi-api-key": os.environ["ELEVENLABS_API_KEY"]},
            json={"text": text, "model_id": "eleven_turbo_v2"},
        )
        return resp.content

Deployment and Testing

Deploy with Docker Compose using three services: the FastAPI backend, PostgreSQL, and an ngrok container for exposing your local WebSocket to Twilio during development. For production, deploy behind an nginx reverse proxy with TLS and configure Twilio to point to your domain.

Test the booking flow end-to-end by calling your Twilio number, requesting an appointment, confirming the details, and verifying the database record. Automated testing uses recorded audio fixtures played through the WebSocket handler.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

How do I handle interruptions when the caller speaks over the AI?

Implement barge-in detection by monitoring the Deepgram transcript stream while TTS audio is playing. When new speech is detected, immediately stop the TTS playback by sending a clear message on the Twilio WebSocket, then process the new utterance.

What latency should I target for a natural voice experience?

Aim for under 800ms total round-trip from end-of-speech to start-of-response-audio. Deepgram Nova-2 typically returns final transcripts within 200ms, the LLM response takes 300-400ms, and ElevenLabs streaming TTS begins output within 200ms.

How do I prevent double-booking?

Use a database-level unique constraint or a SELECT FOR UPDATE lock on the time slot row. Wrap the availability check and booking in a single database transaction so that concurrent callers cannot book the same slot.

#CapstoneProject #VoiceAI #Twilio #AppointmentBooking #STTTTS #FullStackAI #AgenticAI #LearnAI #AIEngineering

Capstone: Building a Voice-Enabled Appointment Booking System from Scratch

System Architecture

Database Schema for Appointments

Twilio WebSocket Integration

Booking Agent with Tool Calls

Speech-to-Text and Text-to-Speech Pipeline

Deployment and Testing

FAQ

How do I handle interruptions when the caller speaks over the AI?

What latency should I target for a natural voice experience?

How do I prevent double-booking?

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Defense, ITAR & AI Voice Vendor Compliance in 2026

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Call Sentiment Time-Series Dashboards for Voice AI in 2026