Skip to content
Learn Agentic AI
Learn Agentic AI12 min read7 views

Building Evaluation Datasets: Synthetic Generation, Human Labeling, and Active Learning

A practical guide to creating high-quality evaluation datasets for AI agents using synthetic data generation, human annotation pipelines, active learning for efficient labeling, and dataset versioning strategies.

The Dataset Is the Evaluation

Your evaluation is only as good as your dataset. A perfect scoring pipeline running against a biased or unrepresentative dataset gives you false confidence. Building evaluation datasets for AI agents is particularly challenging because agent interactions are multi-turn, involve tool calls, and have complex success criteria that go beyond simple text matching.

This guide covers three complementary approaches: synthetic generation for scale, human labeling for quality, and active learning for efficiency. Used together, they give you a dataset that is large enough for statistical reliability, accurate enough for trust, and continuously improving as your agent evolves.

Synthetic Dataset Generation

Use an LLM to generate diverse evaluation samples at scale. The key is generating both the user inputs and the expected agent behavior.

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
import json
import asyncio
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class SyntheticSample:
    sample_id: str
    user_input: str
    expected_response: str
    expected_tool_calls: list[dict] = field(
        default_factory=list
    )
    difficulty: str = "medium"
    tags: list[str] = field(default_factory=list)
    generated_by: str = "synthetic"

async def generate_synthetic_samples(
    llm_client,
    task_description: str,
    tool_definitions: list[dict],
    count: int = 20,
    difficulties: list[str] = None,
) -> list[SyntheticSample]:
    difficulties = difficulties or ["easy", "medium", "hard"]
    tools_text = json.dumps(tool_definitions, indent=2)

    prompt = f"""Generate {count} diverse evaluation samples for
an AI agent with the following task and available tools.

## Task Description
{task_description}

## Available Tools
{tools_text}

For each sample, generate:
1. A realistic user input message
2. The expected agent response (or key points)
3. Expected tool calls with parameters
4. Difficulty level: {difficulties}
5. Tags describing the capability tested

Vary the samples across:
- Different user phrasings and communication styles
- Edge cases and unusual requests
- Multi-step and single-step tasks
- Clear and ambiguous intents

Return JSON array:
[
  {{
    "user_input": "...",
    "expected_response_summary": "...",
    "expected_tool_calls": [{{"name": "...", "params": {{}}}}],
    "difficulty": "easy|medium|hard",
    "tags": ["..."]
  }}
]"""

    response = await llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.9,
    )
    raw = json.loads(response.choices[0].message.content)
    items = raw if isinstance(raw, list) else raw.get("samples", [])

    samples = []
    for i, item in enumerate(items):
        samples.append(SyntheticSample(
            sample_id=f"syn-{i:04d}",
            user_input=item["user_input"],
            expected_response=item.get(
                "expected_response_summary", ""
            ),
            expected_tool_calls=item.get(
                "expected_tool_calls", []
            ),
            difficulty=item.get("difficulty", "medium"),
            tags=item.get("tags", []),
        ))
    return samples

Set the temperature high (0.8 to 1.0) for generation to maximize diversity. Then filter and validate the results. Synthetic data is a starting point — it fills the volume gap while you build out human-labeled gold sets.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Human Annotation Pipeline

Human-labeled data is your ground truth. Design the annotation workflow to maximize consistency and minimize annotator fatigue.

@dataclass
class AnnotationTask:
    task_id: str
    conversation: list[dict]
    agent_response: str
    questions: list[dict]  # What to annotate

@dataclass
class Annotation:
    task_id: str
    annotator_id: str
    labels: dict
    confidence: float  # 0.0 to 1.0
    time_seconds: float
    notes: Optional[str] = None

class AnnotationPipeline:
    def __init__(self, min_annotators: int = 2):
        self.min_annotators = min_annotators
        self.tasks: list[AnnotationTask] = []
        self.annotations: list[Annotation] = []

    def add_task(self, task: AnnotationTask):
        self.tasks.append(task)

    def submit_annotation(self, annotation: Annotation):
        self.annotations.append(annotation)

    def get_consensus(self, task_id: str) -> Optional[dict]:
        task_annotations = [
            a for a in self.annotations
            if a.task_id == task_id
        ]
        if len(task_annotations) < self.min_annotators:
            return None

        # Majority vote per label field
        label_keys = task_annotations[0].labels.keys()
        consensus = {}
        for key in label_keys:
            values = [a.labels[key] for a in task_annotations]
            consensus[key] = max(set(values), key=values.count)

        # Agreement rate
        agreements = sum(
            1 for key in label_keys
            if len(set(a.labels[key] for a in task_annotations)) == 1
        )
        agreement_rate = agreements / len(label_keys)

        return {
            "task_id": task_id,
            "consensus_labels": consensus,
            "agreement_rate": round(agreement_rate, 3),
            "annotator_count": len(task_annotations),
            "avg_confidence": round(
                sum(a.confidence for a in task_annotations)
                / len(task_annotations),
                3,
            ),
        }

Require at least two annotators per sample to catch individual mistakes. When annotators disagree, route the sample to a third annotator or a domain expert. Samples with persistent disagreement often reveal genuinely ambiguous cases that deserve special handling in your evaluation.

Active Learning for Efficient Labeling

Label the samples that matter most — the ones your agent currently gets wrong or is uncertain about.

import random

class ActiveLearningSelector:
    def __init__(self, uncertainty_threshold: float = 0.6):
        self.threshold = uncertainty_threshold
        self.labeled: list[dict] = []
        self.unlabeled: list[dict] = []

    def add_unlabeled(self, samples: list[dict]):
        self.unlabeled.extend(samples)

    def score_uncertainty(self, sample: dict) -> float:
        """Score how uncertain the agent is on this sample.
        Higher = more valuable to label."""
        agent_confidence = sample.get(
            "agent_confidence", 0.5
        )
        # Invert: low agent confidence = high labeling value
        uncertainty = 1.0 - agent_confidence

        # Boost novel patterns
        if sample.get("is_novel_pattern", False):
            uncertainty = min(1.0, uncertainty + 0.2)

        return uncertainty

    def select_batch(self, batch_size: int = 50) -> list[dict]:
        scored = [
            (self.score_uncertainty(s), s)
            for s in self.unlabeled
        ]
        # Mix: 70% highest uncertainty, 30% random
        scored.sort(key=lambda x: -x[0])
        n_uncertain = int(batch_size * 0.7)
        n_random = batch_size - n_uncertain

        selected = [s for _, s in scored[:n_uncertain]]
        remaining = [s for _, s in scored[n_uncertain:]]
        if remaining:
            selected.extend(
                random.sample(
                    remaining, min(n_random, len(remaining))
                )
            )

        # Remove selected from unlabeled pool
        selected_ids = {s.get("id") for s in selected}
        self.unlabeled = [
            s for s in self.unlabeled
            if s.get("id") not in selected_ids
        ]
        return selected

The 70/30 split between uncertain and random samples is important. Pure uncertainty sampling can create a biased dataset that only covers hard cases. The random component ensures your dataset still represents the full distribution of user requests.

Dataset Versioning and Quality Control

Track every change to your dataset so evaluation results are always reproducible.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

import hashlib
from datetime import datetime

@dataclass
class DatasetVersion:
    version: str
    fingerprint: str
    sample_count: int
    created_at: str
    parent_version: Optional[str] = None
    changes: list[str] = field(default_factory=list)

class VersionedDataset:
    def __init__(self, name: str):
        self.name = name
        self.samples: list[dict] = []
        self.versions: list[DatasetVersion] = []

    def fingerprint(self) -> str:
        content = json.dumps(self.samples, sort_keys=True)
        return hashlib.sha256(
            content.encode()
        ).hexdigest()[:12]

    def commit(
        self, version: str, changes: list[str]
    ) -> DatasetVersion:
        parent = (
            self.versions[-1].version
            if self.versions else None
        )
        v = DatasetVersion(
            version=version,
            fingerprint=self.fingerprint(),
            sample_count=len(self.samples),
            created_at=datetime.utcnow().isoformat(),
            parent_version=parent,
            changes=changes,
        )
        self.versions.append(v)
        return v

    def quality_report(self) -> dict:
        tags_coverage = set()
        difficulties = {"easy": 0, "medium": 0, "hard": 0}
        for sample in self.samples:
            tags_coverage.update(sample.get("tags", []))
            diff = sample.get("difficulty", "medium")
            difficulties[diff] = difficulties.get(diff, 0) + 1

        return {
            "total_samples": len(self.samples),
            "unique_tags": len(tags_coverage),
            "difficulty_distribution": difficulties,
            "fingerprint": self.fingerprint(),
            "version": (
                self.versions[-1].version
                if self.versions else "uncommitted"
            ),
        }

Always reference the dataset fingerprint alongside evaluation results. When a score changes, you can immediately determine whether it was caused by a model change or a dataset change.

FAQ

How many samples do I need for a reliable evaluation dataset?

Aim for at least 50 samples per capability or task type. For statistical significance when comparing two models, you need 200 or more samples per comparison. Start with synthetic generation to reach volume, then replace low-quality synthetic samples with human-labeled ones over time. A 500-sample dataset that is 30 percent human-labeled and 70 percent high-quality synthetic is a strong starting point.

How do I detect and remove bad synthetic samples?

Run three quality filters. First, a deterministic filter that catches formatting issues, empty fields, and duplicate inputs. Second, a self-consistency check where you generate the same task twice with different seeds and compare — inconsistent outputs suggest an underspecified prompt. Third, a human spot-check on 10 percent of each generated batch. Track the rejection rate to improve your generation prompts.

When should I create a new dataset version versus modifying the existing one?

Create a new version whenever you add more than 10 percent new samples, remove samples, change annotation guidelines, or fix systematic labeling errors. For small additions (under 10 percent), append and increment a minor version. Always preserve old versions so you can re-run evaluations against them for trend analysis.


#EvaluationDatasets #SyntheticData #DataLabeling #ActiveLearning #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Smolagents lets agents write Python instead of JSON. Why code-as-action reduces tool errors and where the security trade-offs are for production deployments.

Large Language Models

Synthetic Data Pipelines: Magpie, Nemotron, and Self-Taught Data Generation

Synthetic data is now most of the post-training corpus at frontier labs. The 2026 pipelines — Magpie, Nemotron, Self-Taught — and how to build one.

AI Infrastructure

Deploy a Voice Agent on Modal with Python and Serverless GPU

Modal turns a Python function into autoscaling serverless compute with optional GPU. Deploy a LiveKit Agent with one command and get pay-per-second billing.

AI Engineering

Pydantic AI April 2026 Update: Typed Agents and Structured Tools

Pydantic AI's April release tightens the typed-agent loop and adds structured tool definitions. Why type-safe agents reduce production bugs and speed iteration.

AI Engineering

Synthetic Data Generation for Fine-Tuning LLMs (2026 Guide)

Self-Instruct, Evol-Instruct, Magpie, persona-based — five methods, one survival rule: keep at least 25% real data or your model collapses. We walk through Stanford Alpaca's $500 recipe, the 100K → 5K filtering pipeline, and how to avoid Nature's documented collapse failure mode.