Skip to content
Technical Guides
Technical Guides14 min read6 views

ChromaDB RAG for Voice Agents: CallSphere vs Vapi Knowledge Base

How CallSphere uses ChromaDB embeddings + a Lookup specialist agent for voice RAG vs Vapi PDF Knowledge Base. Retrieval quality, indexing, costs.

TL;DR

Vapi Knowledge Base lets you upload PDFs and documents that the assistant can cite during a call — managed embedding, managed retrieval, opaque chunking. CallSphere runs ChromaDB as a self-hosted vector store with a dedicated Lookup specialist agent in IT Helpdesk that performs explicit retrieve-then-answer. Both work for FAQ-style queries; CallSphere's approach gives you tunable chunking, custom retrievers (BM25 hybrid, MMR), and the ability to inspect every retrieval that influenced an answer.

If you can ship one PDF and never look back, Vapi is fine. If you need to know why the agent answered "30-day return policy" instead of "60-day," you need an inspectable RAG pipeline.

Voice RAG Is Different From Chat RAG

Voice agents have constraints chat does not:

  • Latency budget — you have ~250ms before the user notices a gap
  • Token cost — every retrieved chunk lives in the Realtime context across turns
  • Truncation — the LLM has to summarize, not quote, because audio cannot read citation footnotes
  • Failure handling — when retrieval misses, the model must say "I'm not sure" rather than hallucinate

These constraints push you toward smaller chunks, fewer of them, and explicit confidence thresholds.

Vapi Knowledge Base Approach

Vapi exposes a Knowledge Base as a per-assistant resource:

{
  "knowledgeBase": {
    "provider": "trieve",
    "topK": 5,
    "fileIds": ["file_abc123", "file_def456"]
  }
}

Behind the scenes: documents are chunked, embedded, indexed in their managed vector store. At call time, every user query triggers a retrieval and the top-K chunks are injected into the LLM context.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Strengths: zero infra, drop a PDF, done.

Weaknesses:

  • Chunking strategy is fixed
  • Hybrid retrieval (BM25 + dense) is not exposed
  • You cannot inspect which chunks were retrieved for a given turn
  • Re-indexing on document update is manual
  • No metadata filtering (e.g., "only retrieve from Q1 2026 docs")
  • Citations in voice responses are vague

CallSphere ChromaDB Approach

CallSphere ships with ChromaDB embedded in the IT Helpdesk vertical. The architecture is:

User question
   ↓
Orchestrator (IT Triage)
   ↓ hand_off if knowledge query
Lookup Specialist Agent
   ↓ tool: retrieve_kb(query, filters, k=8)
ChromaDB (sentence-transformers/all-MiniLM-L6-v2 embeddings)
   ↓ top-K chunks with metadata
Re-rank (Cohere rerank-3 optional, BM25 hybrid)
   ↓ top-3 chunks
LLM (gpt-4o-realtime) generates audio response
   ↓
Postgres call_logs.retrievals[] for audit

Indexing Pipeline

The IT Helpdesk ingestion script:

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="/data/chroma")
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)
col = client.get_or_create_collection(
    name="it_kb",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"},
)

def index_doc(doc_path: str, doc_meta: dict):
    chunks = semantic_chunker(doc_path, target_tokens=180, overlap=30)
    for i, chunk in enumerate(chunks):
        col.upsert(
            ids=[f"{doc_meta['id']}::{i}"],
            documents=[chunk.text],
            metadatas=[{
                **doc_meta,
                "chunk_index": i,
                "section": chunk.section,
                "updated_at": chunk.updated_at,
            }],
        )

Three deliberate choices:

  • 180-token chunks — small enough for voice context, large enough for semantic coherence
  • 30-token overlap — preserves cross-boundary entities
  • Metadata-rich — enables filters like {"section": "returns", "updated_at": {"$gte": "2026-01-01"}}

Retrieval Tool

The Lookup specialist exposes a single tool:

@tool
async def retrieve_kb(
    query: str,
    section_filter: str | None = None,
    k: int = 8,
) -> RetrievalResult:
    where = {"section": section_filter} if section_filter else {}
    raw = col.query(
        query_texts=[query],
        n_results=k,
        where=where,
    )

    # Hybrid: blend dense scores with BM25 from a parallel index
    bm25_scores = bm25_index.get_scores(query, raw["ids"][0])
    blended = blend(raw["distances"][0], bm25_scores, alpha=0.7)

    # Rerank top-8 to top-3
    top3 = cohere_rerank(query, raw["documents"][0], top_n=3)

    return RetrievalResult(
        chunks=top3,
        confidence=max(blended),
        retrieval_id=str(uuid.uuid4()),
    )

Confidence Threshold

If confidence < 0.55, the specialist tells the user "I am not sure — let me transfer you to a human agent" rather than hallucinate an answer. This is the single most important RAG pattern for voice.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Inspectability

Every retrieval gets a retrieval_id written to Postgres:

SELECT
  cl.call_id,
  r.retrieval_id,
  r.query,
  r.chunks_returned,
  r.confidence,
  r.influenced_response_id
FROM call_logs cl
JOIN retrievals r ON r.call_id = cl.call_id
WHERE cl.created_at > NOW() - INTERVAL '24 hours'
  AND r.confidence < 0.7;

This query surfaces low-confidence retrievals from the last day, which feeds the weekly content gap report — "we kept failing to answer X, write a doc."

Vapi vs CallSphere RAG Comparison

Dimension Vapi Knowledge Base CallSphere ChromaDB
Vector store Managed (Trieve) ChromaDB self-hosted
Embedding model Provider default all-MiniLM-L6-v2 (swappable)
Chunking Fixed Configurable, semantic
Hybrid retrieval Not exposed BM25 + dense blend
Reranking Built-in (opaque) Cohere rerank-3 optional
Metadata filter Limited Full where-clause
Confidence threshold Implicit Explicit, configurable
Inspect retrieval logs No Per-turn in Postgres
Re-indexing Manual upload CI/CD pipeline
Cost Bundled in Vapi pricing Compute + embedding

RAG Retrieval Pipeline

graph LR
    Q[User voice query] --> Orch[Orchestrator]
    Orch -->|hand_off| Lookup[Lookup Specialist]
    Lookup -->|retrieve_kb| Embed[Embed query<br/>MiniLM-L6-v2]
    Embed --> Chroma[(ChromaDB<br/>cosine)]
    Lookup --> BM25[BM25 index]
    Chroma --> Blend[Blend α=0.7]
    BM25 --> Blend
    Blend --> Rerank[Cohere rerank-3]
    Rerank --> Conf{conf > 0.55?}
    Conf -->|yes| LLM[gpt-4o-realtime]
    Conf -->|no| Escalate[Escalate to human]
    LLM --> Audio[PCM16 response]
    LLM --> Log[(retrievals log)]

Practical Tips

  • Cap context at 3 chunks. More chunks = more tokens = more first-token latency.
  • Embed FAQ-shaped paraphrases of source docs. Customers ask in question form; docs are written in declarative form.
  • Re-embed when you change chunkers. Mixing chunkers in one collection is a silent quality-killer.
  • Always log query + chunk IDs. Without this you cannot debug RAG failures.
  • Use metadata filters aggressively. "Only retrieve from active products" beats relevance ranking on stale data.

FAQ

Why ChromaDB and not pgvector?

Both work. ChromaDB has lighter operational overhead for the IT Helpdesk scale (50K-500K chunks). At 5M+ chunks, pgvector or a hosted vector DB wins.

Can I use my own embeddings?

Yes — the embedding function is a config knob. We have run OpenAI text-embedding-3-small and bge-large-en-v1.5 in production.

Does the voice latency budget kill RAG?

Only if you skip rerank or retrieve too many chunks. With k=8 → rerank → top-3, total retrieval round-trip is 80-150ms.

How do you keep the KB fresh?

GitHub repo of source docs → CI/CD pipeline re-chunks and upserts on push. ChromaDB upsert is idempotent on chunk ID.

Can the agent cite sources verbally?

Yes — each chunk carries a source_title metadata field, and the system prompt asks the agent to say "according to our returns policy, ..." when relevant.

Try the IT Helpdesk Demo

The /demo flow includes the IT Helpdesk RAG path; ask it a policy question and inspect the retrieval log. /industries/it-helpdesk has full architecture diagrams.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Engineering

Latency Benchmarking AI Voice Agent Vendors (2026)

Vapi 465ms optimal, Retell 580-620ms, Bland ~800ms, ElevenLabs 400-600ms — but those are best-case. We design a fair benchmark harness, P95 measurement, and a reproducible methodology for 2026.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.