Skip to content
Learn Agentic AI
Learn Agentic AI11 min read57 views

Text-to-SQL Evaluation: Spider, BIRD, and Custom Benchmarks for Accuracy Testing

Understand how to evaluate text-to-SQL systems using the Spider and BIRD benchmarks, implement execution accuracy metrics, and build custom evaluation datasets for your specific database schema.

Why Standard Evaluation Matters

Claiming your text-to-SQL system "works well" without rigorous evaluation is meaningless. Two systems that feel similar in casual testing might differ by 20% in accuracy on edge cases. Benchmarks give you objective measurements to compare models, track improvements, and identify weaknesses.

The text-to-SQL community has developed standardized benchmarks that test across hundreds of databases and thousands of question-query pairs. Understanding these benchmarks — and knowing when to build your own — is essential for production systems.

The Spider Benchmark

Spider is the most widely used text-to-SQL benchmark. It contains 10,181 questions across 200 databases covering 138 domains. Questions are categorized by difficulty: easy (single table, no aggregation), medium (joins, grouping), hard (subqueries, set operations), and extra hard (nested queries, multiple conditions).

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

Key characteristics:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • Cross-database evaluation — the test set uses databases not seen during training
  • SQL complexity levels — from simple SELECT to multi-level nested queries
  • Multiple valid SQL representations — the same question might have several correct SQL formulations
# Example Spider dataset entry
spider_example = {
    "db_id": "concert_singer",
    "question": "How many singers do we have?",
    "query": "SELECT count(*) FROM singer",
    "difficulty": "easy",
}

The BIRD Benchmark

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) addresses limitations in Spider by using real-world databases with messy data, requiring external knowledge, and including value-based questions.

Key differences from Spider:

  • Dirty data — databases contain NULLs, inconsistent formats, and realistic noise
  • External knowledge — some questions require understanding domain conventions (e.g., "fiscal year starts in April")
  • Larger databases — tables with millions of rows where query efficiency matters

Evaluation Metrics

Exact Match Accuracy (EM) compares the predicted SQL string to the reference SQL. This is too strict — SELECT name FROM users and SELECT users.name FROM users are both correct but do not match.

Execution Accuracy (EX) runs both the predicted and reference SQL against the database and compares results. This is the standard metric because it correctly handles multiple valid SQL formulations.

import sqlite3
from typing import Any

def execution_accuracy(predicted_sql: str, reference_sql: str,
                       db_path: str) -> bool:
    """Check if predicted and reference SQL return the same results."""
    conn = sqlite3.connect(db_path)

    try:
        pred_results = set(
            tuple(row) for row in conn.execute(predicted_sql).fetchall()
        )
        ref_results = set(
            tuple(row) for row in conn.execute(reference_sql).fetchall()
        )
        return pred_results == ref_results
    except Exception:
        return False
    finally:
        conn.close()

def evaluate_batch(test_cases: list[dict], model_fn, db_dir: str) -> dict:
    """Evaluate a text-to-SQL model on a batch of test cases."""
    results = {"total": 0, "correct": 0, "errors": 0, "by_difficulty": {}}

    for case in test_cases:
        results["total"] += 1
        db_path = f"{db_dir}/{case['db_id']}/{case['db_id']}.sqlite"

        try:
            predicted = model_fn(case["question"], db_path)
            is_correct = execution_accuracy(predicted, case["query"], db_path)

            if is_correct:
                results["correct"] += 1

            # Track by difficulty
            diff = case.get("difficulty", "unknown")
            if diff not in results["by_difficulty"]:
                results["by_difficulty"][diff] = {"total": 0, "correct": 0}
            results["by_difficulty"][diff]["total"] += 1
            if is_correct:
                results["by_difficulty"][diff]["correct"] += 1

        except Exception as e:
            results["errors"] += 1

    results["accuracy"] = results["correct"] / results["total"] if results["total"] > 0 else 0
    return results

Building a Custom Evaluation Dataset

Standard benchmarks tell you how your model performs in general. But production accuracy depends on your specific schema, your users' question patterns, and your data characteristics. Build a custom evaluation set.

import json
from dataclasses import dataclass, asdict

@dataclass
class EvalCase:
    question: str
    reference_sql: str
    difficulty: str  # easy, medium, hard
    category: str    # e.g., "aggregation", "join", "filter", "date_range"
    notes: str = ""  # Why this case is interesting

class EvalDatasetBuilder:
    """Build and manage a custom text-to-SQL evaluation dataset."""

    def __init__(self, db_path: str):
        self.db_path = db_path
        self.cases: list[EvalCase] = []

    def add_case(self, question: str, reference_sql: str,
                 difficulty: str, category: str, notes: str = ""):
        # Verify the reference SQL actually works
        conn = sqlite3.connect(self.db_path)
        try:
            conn.execute(reference_sql)
        except Exception as e:
            raise ValueError(
                f"Reference SQL is invalid: {e}\nSQL: {reference_sql}"
            )
        finally:
            conn.close()

        self.cases.append(EvalCase(
            question=question,
            reference_sql=reference_sql,
            difficulty=difficulty,
            category=category,
            notes=notes,
        ))

    def save(self, path: str):
        with open(path, "w") as f:
            json.dump([asdict(c) for c in self.cases], f, indent=2)

    def load(self, path: str):
        with open(path) as f:
            self.cases = [EvalCase(**c) for c in json.load(f)]

    def summary(self) -> dict:
        from collections import Counter
        return {
            "total_cases": len(self.cases),
            "by_difficulty": dict(Counter(c.difficulty for c in self.cases)),
            "by_category": dict(Counter(c.category for c in self.cases)),
        }

# Build your dataset
builder = EvalDatasetBuilder("production_analytics.db")
builder.add_case(
    question="How many orders were placed in January 2026?",
    reference_sql="SELECT COUNT(*) FROM orders WHERE order_date >= '2026-01-01' AND order_date < '2026-02-01'",
    difficulty="easy",
    category="date_range",
    notes="Tests date range filtering with boundary conditions",
)
builder.add_case(
    question="Which product category has the highest average order value?",
    reference_sql="""
        SELECT p.category, AVG(oi.unit_price * oi.quantity) as avg_value
        FROM order_items oi
        JOIN products p ON oi.product_id = p.id
        GROUP BY p.category
        ORDER BY avg_value DESC
        LIMIT 1
    """,
    difficulty="medium",
    category="aggregation",
    notes="Requires JOIN and aggregation with sorting",
)
builder.save("eval_dataset.json")

For a production system, aim for at least 100 test cases with this distribution:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • 30% easy — single-table filters, counts, simple aggregations
  • 40% medium — two-table JOINs, GROUP BY with HAVING, date ranges
  • 20% hard — three+ table JOINs, subqueries, window functions
  • 10% adversarial — ambiguous questions, questions with no valid answer, domain-specific terminology

Cover every table and relationship in your schema. Ensure you have cases for each common question pattern your users ask.

FAQ

How often should I re-evaluate my text-to-SQL system?

Re-evaluate whenever you change the model, update the schema, modify the prompt, or add new tables. At minimum, run your evaluation suite weekly in CI/CD. Schema changes are the most common cause of accuracy regression — a renamed column can silently break queries the model previously got right.

Is 80% accuracy good enough for production?

It depends on the use case. For exploratory analytics where users can verify results, 80% is workable with good error messaging. For automated reporting or dashboards where results are consumed without review, you need 95%+ accuracy. Most production systems use error correction loops to bridge this gap.

Can I use Spider or BIRD results to predict production accuracy?

Benchmark accuracy provides a ceiling estimate, not a prediction. Your production accuracy will typically be 5-15% lower than benchmark scores because real users ask messier questions, your schema has domain-specific quirks, and benchmark questions are carefully written to be unambiguous. Always supplement benchmarks with custom evaluation on your own data.


#Evaluation #SpiderBenchmark #BIRDBenchmark #TextToSQL #AccuracyTesting #AgenticAI #MLOps #BenchmarkDriven

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Buyer Guides

AI Receptionist Free Trials: What to Actually Test Before You Buy

A practical guide to evaluating AI receptionist free trials — the 12 tests to run before committing to a vendor.

AI Engineering

Promptfoo as Your Agent Eval CI Gate After the OpenAI Acquisition

OpenAI bought Promptfoo for $86M on March 9, 2026. The open-source library is still active with 350k+ developers. Here is how to wire it as a hard CI gate.

AI Engineering

Chat Agent Feedback Loops in 2026: From Thumbs Up/Down to Real Eval Sets

Thumbs data alone is too noisy to train on. Here is how to build a feedback loop that compounds — escalation reasons, annotation queues, and weekly eval refresh.

Learn Agentic AI

Schema Representation for Text-to-SQL: How to Describe Your Database to LLMs

Master the art of schema representation for text-to-SQL systems. Learn how to format CREATE TABLE statements, add column descriptions, encode foreign key relationships, and provide sample data for maximum query accuracy.

Learn Agentic AI

Text-to-SQL with Claude: Using Anthropic's API for Database Question Answering

Implement a text-to-SQL system using Anthropic's Claude API with tool use for SQL execution, multi-turn conversations, and structured output parsing for reliable database question answering.

Learn Agentic AI

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Explore the leading web agent benchmarks including WebArena, MiniWoB++, and Mind2Web. Learn how evaluation methodology, success metrics, and reproducible environments drive progress in autonomous browser agents.