Building Evaluation Datasets: Synthetic Generation, Human Labeling, and Active Learning

The Dataset Is the Evaluation

Your evaluation is only as good as your dataset. A perfect scoring pipeline running against a biased or unrepresentative dataset gives you false confidence. Building evaluation datasets for AI agents is particularly challenging because agent interactions are multi-turn, involve tool calls, and have complex success criteria that go beyond simple text matching.

This guide covers three complementary approaches: synthetic generation for scale, human labeling for quality, and active learning for efficiency. Used together, they give you a dataset that is large enough for statistical reliability, accurate enough for trust, and continuously improving as your agent evolves.

Synthetic Dataset Generation

Use an LLM to generate diverse evaluation samples at scale. The key is generating both the user inputs and the expected agent behavior.

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

import json
import asyncio
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class SyntheticSample:
    sample_id: str
    user_input: str
    expected_response: str
    expected_tool_calls: list[dict] = field(
        default_factory=list
    )
    difficulty: str = "medium"
    tags: list[str] = field(default_factory=list)
    generated_by: str = "synthetic"

async def generate_synthetic_samples(
    llm_client,
    task_description: str,
    tool_definitions: list[dict],
    count: int = 20,
    difficulties: list[str] = None,
) -> list[SyntheticSample]:
    difficulties = difficulties or ["easy", "medium", "hard"]
    tools_text = json.dumps(tool_definitions, indent=2)

    prompt = f"""Generate {count} diverse evaluation samples for
an AI agent with the following task and available tools.

## Task Description
{task_description}

## Available Tools
{tools_text}

For each sample, generate:
1. A realistic user input message
2. The expected agent response (or key points)
3. Expected tool calls with parameters
4. Difficulty level: {difficulties}
5. Tags describing the capability tested

Vary the samples across:
- Different user phrasings and communication styles
- Edge cases and unusual requests
- Multi-step and single-step tasks
- Clear and ambiguous intents

Return JSON array:
[
  {{
    "user_input": "...",
    "expected_response_summary": "...",
    "expected_tool_calls": [{{"name": "...", "params": {{}}}}],
    "difficulty": "easy|medium|hard",
    "tags": ["..."]
  }}
]"""

    response = await llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.9,
    )
    raw = json.loads(response.choices[0].message.content)
    items = raw if isinstance(raw, list) else raw.get("samples", [])

    samples = []
    for i, item in enumerate(items):
        samples.append(SyntheticSample(
            sample_id=f"syn-{i:04d}",
            user_input=item["user_input"],
            expected_response=item.get(
                "expected_response_summary", ""
            ),
            expected_tool_calls=item.get(
                "expected_tool_calls", []
            ),
            difficulty=item.get("difficulty", "medium"),
            tags=item.get("tags", []),
        ))
    return samples

Set the temperature high (0.8 to 1.0) for generation to maximize diversity. Then filter and validate the results. Synthetic data is a starting point — it fills the volume gap while you build out human-labeled gold sets.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Human Annotation Pipeline

Human-labeled data is your ground truth. Design the annotation workflow to maximize consistency and minimize annotator fatigue.

@dataclass
class AnnotationTask:
    task_id: str
    conversation: list[dict]
    agent_response: str
    questions: list[dict]  # What to annotate

@dataclass
class Annotation:
    task_id: str
    annotator_id: str
    labels: dict
    confidence: float  # 0.0 to 1.0
    time_seconds: float
    notes: Optional[str] = None

class AnnotationPipeline:
    def __init__(self, min_annotators: int = 2):
        self.min_annotators = min_annotators
        self.tasks: list[AnnotationTask] = []
        self.annotations: list[Annotation] = []

    def add_task(self, task: AnnotationTask):
        self.tasks.append(task)

    def submit_annotation(self, annotation: Annotation):
        self.annotations.append(annotation)

    def get_consensus(self, task_id: str) -> Optional[dict]:
        task_annotations = [
            a for a in self.annotations
            if a.task_id == task_id
        ]
        if len(task_annotations) < self.min_annotators:
            return None

        # Majority vote per label field
        label_keys = task_annotations[0].labels.keys()
        consensus = {}
        for key in label_keys:
            values = [a.labels[key] for a in task_annotations]
            consensus[key] = max(set(values), key=values.count)

        # Agreement rate
        agreements = sum(
            1 for key in label_keys
            if len(set(a.labels[key] for a in task_annotations)) == 1
        )
        agreement_rate = agreements / len(label_keys)

        return {
            "task_id": task_id,
            "consensus_labels": consensus,
            "agreement_rate": round(agreement_rate, 3),
            "annotator_count": len(task_annotations),
            "avg_confidence": round(
                sum(a.confidence for a in task_annotations)
                / len(task_annotations),
                3,
            ),
        }

Require at least two annotators per sample to catch individual mistakes. When annotators disagree, route the sample to a third annotator or a domain expert. Samples with persistent disagreement often reveal genuinely ambiguous cases that deserve special handling in your evaluation.

Active Learning for Efficient Labeling

Label the samples that matter most — the ones your agent currently gets wrong or is uncertain about.

import random

class ActiveLearningSelector:
    def __init__(self, uncertainty_threshold: float = 0.6):
        self.threshold = uncertainty_threshold
        self.labeled: list[dict] = []
        self.unlabeled: list[dict] = []

    def add_unlabeled(self, samples: list[dict]):
        self.unlabeled.extend(samples)

    def score_uncertainty(self, sample: dict) -> float:
        """Score how uncertain the agent is on this sample.
        Higher = more valuable to label."""
        agent_confidence = sample.get(
            "agent_confidence", 0.5
        )
        # Invert: low agent confidence = high labeling value
        uncertainty = 1.0 - agent_confidence

        # Boost novel patterns
        if sample.get("is_novel_pattern", False):
            uncertainty = min(1.0, uncertainty + 0.2)

        return uncertainty

    def select_batch(self, batch_size: int = 50) -> list[dict]:
        scored = [
            (self.score_uncertainty(s), s)
            for s in self.unlabeled
        ]
        # Mix: 70% highest uncertainty, 30% random
        scored.sort(key=lambda x: -x[0])
        n_uncertain = int(batch_size * 0.7)
        n_random = batch_size - n_uncertain

        selected = [s for _, s in scored[:n_uncertain]]
        remaining = [s for _, s in scored[n_uncertain:]]
        if remaining:
            selected.extend(
                random.sample(
                    remaining, min(n_random, len(remaining))
                )
            )

        # Remove selected from unlabeled pool
        selected_ids = {s.get("id") for s in selected}
        self.unlabeled = [
            s for s in self.unlabeled
            if s.get("id") not in selected_ids
        ]
        return selected

The 70/30 split between uncertain and random samples is important. Pure uncertainty sampling can create a biased dataset that only covers hard cases. The random component ensures your dataset still represents the full distribution of user requests.

Dataset Versioning and Quality Control

Track every change to your dataset so evaluation results are always reproducible.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

import hashlib
from datetime import datetime

@dataclass
class DatasetVersion:
    version: str
    fingerprint: str
    sample_count: int
    created_at: str
    parent_version: Optional[str] = None
    changes: list[str] = field(default_factory=list)

class VersionedDataset:
    def __init__(self, name: str):
        self.name = name
        self.samples: list[dict] = []
        self.versions: list[DatasetVersion] = []

    def fingerprint(self) -> str:
        content = json.dumps(self.samples, sort_keys=True)
        return hashlib.sha256(
            content.encode()
        ).hexdigest()[:12]

    def commit(
        self, version: str, changes: list[str]
    ) -> DatasetVersion:
        parent = (
            self.versions[-1].version
            if self.versions else None
        )
        v = DatasetVersion(
            version=version,
            fingerprint=self.fingerprint(),
            sample_count=len(self.samples),
            created_at=datetime.utcnow().isoformat(),
            parent_version=parent,
            changes=changes,
        )
        self.versions.append(v)
        return v

    def quality_report(self) -> dict:
        tags_coverage = set()
        difficulties = {"easy": 0, "medium": 0, "hard": 0}
        for sample in self.samples:
            tags_coverage.update(sample.get("tags", []))
            diff = sample.get("difficulty", "medium")
            difficulties[diff] = difficulties.get(diff, 0) + 1

        return {
            "total_samples": len(self.samples),
            "unique_tags": len(tags_coverage),
            "difficulty_distribution": difficulties,
            "fingerprint": self.fingerprint(),
            "version": (
                self.versions[-1].version
                if self.versions else "uncommitted"
            ),
        }

Always reference the dataset fingerprint alongside evaluation results. When a score changes, you can immediately determine whether it was caused by a model change or a dataset change.

FAQ

How many samples do I need for a reliable evaluation dataset?

Aim for at least 50 samples per capability or task type. For statistical significance when comparing two models, you need 200 or more samples per comparison. Start with synthetic generation to reach volume, then replace low-quality synthetic samples with human-labeled ones over time. A 500-sample dataset that is 30 percent human-labeled and 70 percent high-quality synthetic is a strong starting point.

How do I detect and remove bad synthetic samples?

Run three quality filters. First, a deterministic filter that catches formatting issues, empty fields, and duplicate inputs. Second, a self-consistency check where you generate the same task twice with different seeds and compare — inconsistent outputs suggest an underspecified prompt. Third, a human spot-check on 10 percent of each generated batch. Track the rejection rate to improve your generation prompts.

When should I create a new dataset version versus modifying the existing one?

Create a new version whenever you add more than 10 percent new samples, remove samples, change annotation guidelines, or fix systematic labeling errors. For small additions (under 10 percent), append and increment a minor version. Always preserve old versions so you can re-run evaluations against them for trend analysis.

#EvaluationDatasets #SyntheticData #DataLabeling #ActiveLearning #Python #AgenticAI #LearnAI #AIEngineering

Building Evaluation Datasets: Synthetic Generation, Human Labeling, and Active Learning

The Dataset Is the Evaluation

Synthetic Dataset Generation

Human Annotation Pipeline

Active Learning for Efficient Labeling

Dataset Versioning and Quality Control

FAQ

How many samples do I need for a reliable evaluation dataset?

How do I detect and remove bad synthetic samples?

When should I create a new dataset version versus modifying the existing one?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Synthetic Data Pipelines: Magpie, Nemotron, and Self-Taught Data Generation

Deploy a Voice Agent on Modal with Python and Serverless GPU

Pydantic AI April 2026 Update: Typed Agents and Structured Tools

Synthetic Data Generation for Fine-Tuning LLMs (2026 Guide)