Skip to content
Learn Agentic AI
Learn Agentic AI14 min read1 views

Call Recording and Transcription for AI Analysis: Building a Call Analytics Pipeline

Build a complete call analytics pipeline that records calls, transcribes them, and extracts actionable insights using AI. Covers recording APIs, speaker diarization, sentiment analysis, and trend detection.

Why Call Analytics Matters

Every phone call your business handles is a goldmine of unstructured data — customer pain points, competitor mentions, product feedback, and sales signals. Without a structured analytics pipeline, these insights vanish the moment the call ends. A call analytics pipeline captures recordings, transcribes them accurately, and uses AI to extract structured insights at scale.

The pipeline has four stages: recording, transcription, analysis, and storage. Each stage feeds the next, and the final output is a structured dataset you can query, visualize, and act on.

Stage 1: Recording Calls

Using Twilio as an example, enabling call recording is a single parameter in your TwiML:

flowchart LR
    SRC[("Sources<br/>DB, S3, APIs")]
    EXT["Extract<br/>CDC or batch"]
    STAGE[("Raw zone")]
    XFRM["Transform<br/>dbt models"]
    QUAL["Quality checks<br/>Great Expectations"]
    CURATED[("Curated zone")]
    LOAD["Load to warehouse"]
    DW[("Snowflake or BigQuery")]
    ML[("Feature store")]
    SRC --> EXT --> STAGE --> XFRM --> QUAL --> CURATED --> LOAD
    LOAD --> DW
    LOAD --> ML
    style XFRM fill:#4f46e5,stroke:#4338ca,color:#fff
    style QUAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DW fill:#059669,stroke:#047857,color:#fff
from twilio.twiml.voice_response import VoiceResponse
from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/incoming-call")
async def handle_call(request: Request):
    response = VoiceResponse()

    # Enable dual-channel recording (separate tracks per speaker)
    response.start().record(
        name="call-recording",
        track="both_legs",  # Separate caller and agent audio
    )

    response.say("Thank you for calling. How can I help?")
    gather = response.gather(input="speech", action="/handle-speech")
    return Response(content=str(response), media_type="application/xml")

@app.post("/recording-status")
async def recording_complete(request: Request):
    """Webhook called when recording is finalized."""
    form = await request.form()
    recording_sid = form["RecordingSid"]
    recording_url = form["RecordingUrl"]
    duration = int(form["RecordingDuration"])
    call_sid = form["CallSid"]

    # Trigger the transcription pipeline
    await start_transcription_pipeline(
        recording_sid=recording_sid,
        recording_url=f"{recording_url}.wav",
        duration=duration,
        call_sid=call_sid,
    )
    return {"status": "accepted"}

Dual-channel recording is critical for analytics — it puts each speaker on a separate audio track, which dramatically improves transcription accuracy and makes speaker diarization trivial.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Stage 2: Transcription with Speaker Diarization

Download the recording and run it through a speech-to-text engine with speaker separation:

import httpx
from deepgram import DeepgramClient, PrerecordedOptions

deepgram = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])

async def transcribe_recording(recording_url: str, auth_token: str):
    """Download recording and transcribe with speaker diarization."""
    # Download the recording from Twilio
    async with httpx.AsyncClient() as client:
        resp = await client.get(
            recording_url,
            auth=(os.environ["TWILIO_ACCOUNT_SID"], auth_token),
        )
        audio_bytes = resp.content

    # Transcribe with Deepgram (diarization + punctuation)
    options = PrerecordedOptions(
        model="nova-2",
        smart_format=True,
        diarize=True,
        punctuate=True,
        utterances=True,
        language="en-US",
    )

    response = await deepgram.listen.asyncrest.v("1").transcribe_file(
        {"buffer": audio_bytes, "mimetype": "audio/wav"},
        options,
    )

    # Structure the transcript by speaker
    utterances = response.results.utterances
    structured_transcript = []
    for utterance in utterances:
        structured_transcript.append({
            "speaker": f"Speaker {utterance.speaker}",
            "text": utterance.transcript,
            "start": utterance.start,
            "end": utterance.end,
            "confidence": utterance.confidence,
        })

    return structured_transcript

Stage 3: AI-Powered Analysis

With a structured transcript in hand, use an LLM to extract insights:

from openai import AsyncOpenAI

client = AsyncOpenAI()

ANALYSIS_PROMPT = """Analyze this call transcript and extract:
1. **Summary**: 2-3 sentence summary of the call
2. **Sentiment**: overall (positive/neutral/negative), and per-speaker
3. **Intent**: caller's primary intent (support, sales, complaint, etc.)
4. **Key Topics**: list of topics discussed
5. **Action Items**: any follow-up actions promised
6. **Satisfaction Score**: 1-10 estimate of caller satisfaction
7. **Escalation Risk**: low/medium/high
8. **Competitor Mentions**: any competitor names mentioned

Return valid JSON matching this schema exactly."""

async def analyze_transcript(transcript: list[dict]) -> dict:
    """Run AI analysis on a structured transcript."""
    # Format transcript for the LLM
    formatted = "\n".join(
        f"[{t['speaker']}] ({t['start']:.1f}s): {t['text']}"
        for t in transcript
    )

    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": ANALYSIS_PROMPT},
            {"role": "user", "content": formatted},
        ],
        response_format={"type": "json_object"},
        temperature=0.2,
    )

    import json
    return json.loads(response.choices[0].message.content)

Stage 4: Storage and Querying

Store the raw transcript and analysis results in a database optimized for querying:

import asyncpg
import json
from datetime import datetime

async def store_call_analysis(
    pool: asyncpg.Pool,
    call_sid: str,
    transcript: list[dict],
    analysis: dict,
    duration: int,
):
    """Persist call data and analysis to PostgreSQL."""
    await pool.execute(
        """
        INSERT INTO call_analytics (
            call_sid, transcript, summary, sentiment,
            intent, topics, action_items, satisfaction_score,
            escalation_risk, competitor_mentions,
            duration_seconds, analyzed_at
        ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12)
        """,
        call_sid,
        json.dumps(transcript),
        analysis["summary"],
        analysis["sentiment"],
        analysis["intent"],
        analysis["key_topics"],
        json.dumps(analysis["action_items"]),
        analysis["satisfaction_score"],
        analysis["escalation_risk"],
        analysis.get("competitor_mentions", []),
        duration,
        datetime.utcnow(),
    )

async def get_insights_summary(pool: asyncpg.Pool, days: int = 7):
    """Query aggregate insights over a time period."""
    return await pool.fetch(
        """
        SELECT
            intent,
            COUNT(*) as call_count,
            AVG(satisfaction_score) as avg_satisfaction,
            COUNT(*) FILTER (WHERE escalation_risk = 'high') as escalations,
            array_agg(DISTINCT unnest_topics) as all_topics
        FROM call_analytics,
             LATERAL unnest(topics) as unnest_topics
        WHERE analyzed_at >= NOW() - make_interval(days => $1)
        GROUP BY intent
        ORDER BY call_count DESC
        """,
        days,
    )

The Complete Pipeline

Wire all four stages together with an async task queue:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

async def start_transcription_pipeline(
    recording_sid: str,
    recording_url: str,
    duration: int,
    call_sid: str,
):
    """Orchestrate the full recording-to-insights pipeline."""
    # Stage 2: Transcribe
    transcript = await transcribe_recording(
        recording_url, os.environ["TWILIO_AUTH_TOKEN"]
    )

    # Stage 3: Analyze
    analysis = await analyze_transcript(transcript)

    # Stage 4: Store
    await store_call_analysis(
        db_pool, call_sid, transcript, analysis, duration
    )

    print(f"Pipeline complete for call {call_sid}: "
          f"intent={analysis['intent']}, "
          f"satisfaction={analysis['satisfaction_score']}/10")

FAQ

How long does the pipeline take per call?

Transcription takes roughly 20-30% of the call duration with modern engines like Deepgram Nova-2. AI analysis adds 2-5 seconds. For a 5-minute call, expect the full pipeline to complete in about 90 seconds. Run it asynchronously after the call ends so it never impacts call quality.

Recording laws vary by jurisdiction. In "two-party consent" states (like California) and countries (like Germany), you must inform all parties and obtain consent before recording. Add a recording disclosure at the start of every call and implement a mechanism to disable recording if consent is denied. Consult legal counsel for your specific jurisdictions.

How accurate is modern speech-to-text for phone calls?

Modern engines like Deepgram Nova-2 and OpenAI Whisper achieve 90-95% accuracy on clean phone audio. Accuracy drops with heavy accents, background noise, or poor phone connections. Dual-channel recording improves accuracy by 5-10% because each speaker has a clean audio track. Always store the raw recording alongside the transcript so you can re-transcribe as models improve.


#CallAnalytics #Transcription #SentimentAnalysis #SpeechtoText #VoiceAI #DataPipeline #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.