Text-to-SQL Evaluation: Spider, BIRD, and Custom Benchmarks for Accuracy Testing

Why Standard Evaluation Matters

Claiming your text-to-SQL system "works well" without rigorous evaluation is meaningless. Two systems that feel similar in casual testing might differ by 20% in accuracy on edge cases. Benchmarks give you objective measurements to compare models, track improvements, and identify weaknesses.

The text-to-SQL community has developed standardized benchmarks that test across hundreds of databases and thousands of question-query pairs. Understanding these benchmarks — and knowing when to build your own — is essential for production systems.

The Spider Benchmark

Spider is the most widely used text-to-SQL benchmark. It contains 10,181 questions across 200 databases covering 138 domains. Questions are categorized by difficulty: easy (single table, no aggregation), medium (joins, grouping), hard (subqueries, set operations), and extra hard (nested queries, multiple conditions).

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

Key characteristics:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Cross-database evaluation — the test set uses databases not seen during training
SQL complexity levels — from simple SELECT to multi-level nested queries
Multiple valid SQL representations — the same question might have several correct SQL formulations

# Example Spider dataset entry
spider_example = {
    "db_id": "concert_singer",
    "question": "How many singers do we have?",
    "query": "SELECT count(*) FROM singer",
    "difficulty": "easy",
}

The BIRD Benchmark

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) addresses limitations in Spider by using real-world databases with messy data, requiring external knowledge, and including value-based questions.

Key differences from Spider:

Dirty data — databases contain NULLs, inconsistent formats, and realistic noise
External knowledge — some questions require understanding domain conventions (e.g., "fiscal year starts in April")
Larger databases — tables with millions of rows where query efficiency matters

Evaluation Metrics

Exact Match Accuracy (EM) compares the predicted SQL string to the reference SQL. This is too strict — SELECT name FROM users and SELECT users.name FROM users are both correct but do not match.

Execution Accuracy (EX) runs both the predicted and reference SQL against the database and compares results. This is the standard metric because it correctly handles multiple valid SQL formulations.

import sqlite3
from typing import Any

def execution_accuracy(predicted_sql: str, reference_sql: str,
                       db_path: str) -> bool:
    """Check if predicted and reference SQL return the same results."""
    conn = sqlite3.connect(db_path)

    try:
        pred_results = set(
            tuple(row) for row in conn.execute(predicted_sql).fetchall()
        )
        ref_results = set(
            tuple(row) for row in conn.execute(reference_sql).fetchall()
        )
        return pred_results == ref_results
    except Exception:
        return False
    finally:
        conn.close()

def evaluate_batch(test_cases: list[dict], model_fn, db_dir: str) -> dict:
    """Evaluate a text-to-SQL model on a batch of test cases."""
    results = {"total": 0, "correct": 0, "errors": 0, "by_difficulty": {}}

    for case in test_cases:
        results["total"] += 1
        db_path = f"{db_dir}/{case['db_id']}/{case['db_id']}.sqlite"

        try:
            predicted = model_fn(case["question"], db_path)
            is_correct = execution_accuracy(predicted, case["query"], db_path)

            if is_correct:
                results["correct"] += 1

            # Track by difficulty
            diff = case.get("difficulty", "unknown")
            if diff not in results["by_difficulty"]:
                results["by_difficulty"][diff] = {"total": 0, "correct": 0}
            results["by_difficulty"][diff]["total"] += 1
            if is_correct:
                results["by_difficulty"][diff]["correct"] += 1

        except Exception as e:
            results["errors"] += 1

    results["accuracy"] = results["correct"] / results["total"] if results["total"] > 0 else 0
    return results

Building a Custom Evaluation Dataset

Standard benchmarks tell you how your model performs in general. But production accuracy depends on your specific schema, your users' question patterns, and your data characteristics. Build a custom evaluation set.

import json
from dataclasses import dataclass, asdict

@dataclass
class EvalCase:
    question: str
    reference_sql: str
    difficulty: str  # easy, medium, hard
    category: str    # e.g., "aggregation", "join", "filter", "date_range"
    notes: str = ""  # Why this case is interesting

class EvalDatasetBuilder:
    """Build and manage a custom text-to-SQL evaluation dataset."""

    def __init__(self, db_path: str):
        self.db_path = db_path
        self.cases: list[EvalCase] = []

    def add_case(self, question: str, reference_sql: str,
                 difficulty: str, category: str, notes: str = ""):
        # Verify the reference SQL actually works
        conn = sqlite3.connect(self.db_path)
        try:
            conn.execute(reference_sql)
        except Exception as e:
            raise ValueError(
                f"Reference SQL is invalid: {e}\nSQL: {reference_sql}"
            )
        finally:
            conn.close()

        self.cases.append(EvalCase(
            question=question,
            reference_sql=reference_sql,
            difficulty=difficulty,
            category=category,
            notes=notes,
        ))

    def save(self, path: str):
        with open(path, "w") as f:
            json.dump([asdict(c) for c in self.cases], f, indent=2)

    def load(self, path: str):
        with open(path) as f:
            self.cases = [EvalCase(**c) for c in json.load(f)]

    def summary(self) -> dict:
        from collections import Counter
        return {
            "total_cases": len(self.cases),
            "by_difficulty": dict(Counter(c.difficulty for c in self.cases)),
            "by_category": dict(Counter(c.category for c in self.cases)),
        }

# Build your dataset
builder = EvalDatasetBuilder("production_analytics.db")
builder.add_case(
    question="How many orders were placed in January 2026?",
    reference_sql="SELECT COUNT(*) FROM orders WHERE order_date >= '2026-01-01' AND order_date < '2026-02-01'",
    difficulty="easy",
    category="date_range",
    notes="Tests date range filtering with boundary conditions",
)
builder.add_case(
    question="Which product category has the highest average order value?",
    reference_sql="""
        SELECT p.category, AVG(oi.unit_price * oi.quantity) as avg_value
        FROM order_items oi
        JOIN products p ON oi.product_id = p.id
        GROUP BY p.category
        ORDER BY avg_value DESC
        LIMIT 1
    """,
    difficulty="medium",
    category="aggregation",
    notes="Requires JOIN and aggregation with sorting",
)
builder.save("eval_dataset.json")

Recommended Evaluation Set Composition

For a production system, aim for at least 100 test cases with this distribution:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

30% easy — single-table filters, counts, simple aggregations
40% medium — two-table JOINs, GROUP BY with HAVING, date ranges
20% hard — three+ table JOINs, subqueries, window functions
10% adversarial — ambiguous questions, questions with no valid answer, domain-specific terminology

Cover every table and relationship in your schema. Ensure you have cases for each common question pattern your users ask.

FAQ

How often should I re-evaluate my text-to-SQL system?

Re-evaluate whenever you change the model, update the schema, modify the prompt, or add new tables. At minimum, run your evaluation suite weekly in CI/CD. Schema changes are the most common cause of accuracy regression — a renamed column can silently break queries the model previously got right.

Is 80% accuracy good enough for production?

It depends on the use case. For exploratory analytics where users can verify results, 80% is workable with good error messaging. For automated reporting or dashboards where results are consumed without review, you need 95%+ accuracy. Most production systems use error correction loops to bridge this gap.

Can I use Spider or BIRD results to predict production accuracy?

Benchmark accuracy provides a ceiling estimate, not a prediction. Your production accuracy will typically be 5-15% lower than benchmark scores because real users ask messier questions, your schema has domain-specific quirks, and benchmark questions are carefully written to be unambiguous. Always supplement benchmarks with custom evaluation on your own data.

#Evaluation #SpiderBenchmark #BIRDBenchmark #TextToSQL #AccuracyTesting #AgenticAI #MLOps #BenchmarkDriven

Text-to-SQL Evaluation: Spider, BIRD, and Custom Benchmarks for Accuracy Testing

Why Standard Evaluation Matters

The Spider Benchmark

The BIRD Benchmark

Evaluation Metrics

Building a Custom Evaluation Dataset

Recommended Evaluation Set Composition

FAQ

How often should I re-evaluate my text-to-SQL system?

Is 80% accuracy good enough for production?

Can I use Spider or BIRD results to predict production accuracy?

Try CallSphere AI Voice Agents

Related Articles You May Like

AI Receptionist Free Trials: What to Actually Test Before You Buy

Promptfoo as Your Agent Eval CI Gate After the OpenAI Acquisition

Chat Agent Feedback Loops in 2026: From Thumbs Up/Down to Real Eval Sets

Schema Representation for Text-to-SQL: How to Describe Your Database to LLMs

Text-to-SQL with Claude: Using Anthropic's API for Database Question Answering

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance