LLM-as-Judge: Using AI to Evaluate AI Agent Responses Automatically

Why Use an LLM to Judge Another LLM

Human evaluation is the gold standard for assessing agent quality, but it does not scale. Reviewing 500 agent responses manually takes days. LLM-as-Judge is a technique where you use a strong language model to score the outputs of your agent automatically, giving you scalable evaluation that correlates well with human judgment when calibrated correctly.

Research from teams at Google, Anthropic, and OpenAI shows that GPT-4-class models achieve 80-90% agreement with human raters on well-defined criteria. The key is writing precise rubrics and calibrating the judge against human labels.

Basic Judge Implementation

A judge is simply an LLM call with a structured prompt that asks for a score and justification.

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

import openai
import json
from dataclasses import dataclass

@dataclass
class JudgeResult:
    score: int           # 1-5
    reasoning: str
    criteria_scores: dict[str, int]

def evaluate_response(
    question: str,
    agent_response: str,
    reference_answer: str,
    client: openai.OpenAI,
    model: str = "gpt-4o",
) -> JudgeResult:
    prompt = f"""You are an expert evaluator. Score the following agent response.

Question: {question}
Reference Answer: {reference_answer}
Agent Response: {agent_response}

Score each criterion from 1 (poor) to 5 (excellent):
1. Correctness: Is the information accurate?
2. Completeness: Does it address all parts of the question?
3. Clarity: Is the response well-organized and easy to understand?

Return JSON:
{{"correctness": <int>, "completeness": <int>, "clarity": <int>, "overall": <int>, "reasoning": "<explanation>"}}
"""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    data = json.loads(response.choices[0].message.content)
    return JudgeResult(
        score=data["overall"],
        reasoning=data["reasoning"],
        criteria_scores={
            k: data[k] for k in ["correctness", "completeness", "clarity"]
        },
    )

Designing Scoring Rubrics

Vague rubrics produce inconsistent scores. Anchor each score level to concrete behaviors.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

RUBRIC = """
## Correctness Rubric
- 5: All facts are accurate, no hallucinations
- 4: Minor inaccuracy that does not affect the main answer
- 3: One significant error, but the core answer is correct
- 2: Multiple errors or a critical factual mistake
- 1: The answer is fundamentally wrong or fabricated

## Completeness Rubric
- 5: Addresses every part of the question with sufficient detail
- 4: Addresses all parts but one lacks detail
- 3: Misses one part of a multi-part question
- 2: Only partially addresses the question
- 1: Fails to address the question at all
"""

def build_judge_prompt(question: str, response: str, rubric: str = RUBRIC) -> str:
    return f"""Evaluate this agent response using the rubric below.

{rubric}

Question: {question}
Response: {response}

Return JSON with scores and reasoning for each criterion."""

Calibration Against Human Labels

Before trusting LLM-as-Judge scores, calibrate by comparing judge scores to human ratings on a labeled subset.

import numpy as np
from scipy import stats

def calibrate_judge(
    human_scores: list[int],
    judge_scores: list[int],
) -> dict:
    """Compare judge scores against human ground truth."""
    correlation, p_value = stats.spearmanr(human_scores, judge_scores)

    exact_match = sum(h == j for h, j in zip(human_scores, judge_scores))
    within_one = sum(abs(h - j) <= 1 for h, j in zip(human_scores, judge_scores))

    return {
        "spearman_correlation": round(correlation, 3),
        "p_value": round(p_value, 4),
        "exact_match_pct": round(exact_match / len(human_scores) * 100, 1),
        "within_one_pct": round(within_one / len(human_scores) * 100, 1),
        "judge_mean": round(np.mean(judge_scores), 2),
        "human_mean": round(np.mean(human_scores), 2),
        "bias": round(np.mean(judge_scores) - np.mean(human_scores), 2),
    }

A Spearman correlation above 0.7 and within-one agreement above 85% indicates a reliable judge. If bias is consistently positive, the judge is too lenient — adjust the rubric.

Multi-Criteria Evaluation

Combine individual criteria into a weighted overall score.

def weighted_score(criteria_scores: dict[str, int], weights: dict[str, float]) -> float:
    total = sum(criteria_scores[k] * weights[k] for k in criteria_scores)
    weight_sum = sum(weights[k] for k in criteria_scores)
    return round(total / weight_sum, 2)

# For a customer support agent, correctness matters most
SUPPORT_WEIGHTS = {"correctness": 0.5, "completeness": 0.3, "clarity": 0.2}

# For a creative writing agent, clarity matters most
CREATIVE_WEIGHTS = {"correctness": 0.2, "completeness": 0.2, "clarity": 0.6}

Running Batch Evaluations

Evaluate your full dataset efficiently with concurrent judge calls.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

import asyncio

async def batch_evaluate(
    eval_cases: list[dict],
    agent_fn,
    judge_fn,
    concurrency: int = 5,
) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def process_case(case):
        async with semaphore:
            agent_output = await agent_fn(case["input"])
            judge_result = await judge_fn(
                case["input"], agent_output, case["expected"]
            )
            return {**case, "output": agent_output, "judge": judge_result}

    tasks = [process_case(c) for c in eval_cases]
    results = await asyncio.gather(*tasks)
    return results

FAQ

Does the judge model need to be stronger than the agent model?

Yes, generally. A GPT-4o judge evaluating GPT-3.5 agent outputs works well. Judging a model with an equally capable or weaker model produces unreliable scores because the judge may share the same blind spots.

How do I prevent position bias in the judge?

When comparing two responses (A vs B), run the evaluation twice — once with A first, once with B first — and average the results. This counteracts the tendency for LLMs to prefer whichever response appears first.

How much does LLM-as-Judge cost compared to human evaluation?

Evaluating 1,000 responses with GPT-4o costs roughly two to five dollars depending on response length. The same volume of human evaluation typically costs hundreds of dollars and takes days. LLM-as-Judge is roughly 50-100x cheaper.

#LLMasJudge #Evaluation #AIAgents #AutomatedTesting #Python #ScoringRubrics #AgenticAI #LearnAI #AIEngineering

LLM-as-Judge: Using AI to Evaluate AI Agent Responses Automatically

Why Use an LLM to Judge Another LLM

Basic Judge Implementation

Designing Scoring Rubrics

Calibration Against Human Labels

Multi-Criteria Evaluation

Running Batch Evaluations

FAQ

Does the judge model need to be stronger than the agent model?

How do I prevent position bias in the judge?

How much does LLM-as-Judge cost compared to human evaluation?

Try CallSphere AI Voice Agents

Related Articles You May Like

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals