Skip to content
Learn Agentic AI
Learn Agentic AI12 min read16 views

LLM-as-Judge: Using AI to Evaluate AI Agent Responses Automatically

Learn how to use LLMs as automated judges to evaluate AI agent responses with scoring rubrics, calibration techniques, and multi-criteria evaluation frameworks in Python.

Why Use an LLM to Judge Another LLM

Human evaluation is the gold standard for assessing agent quality, but it does not scale. Reviewing 500 agent responses manually takes days. LLM-as-Judge is a technique where you use a strong language model to score the outputs of your agent automatically, giving you scalable evaluation that correlates well with human judgment when calibrated correctly.

Research from teams at Google, Anthropic, and OpenAI shows that GPT-4-class models achieve 80-90% agreement with human raters on well-defined criteria. The key is writing precise rubrics and calibrating the judge against human labels.

Basic Judge Implementation

A judge is simply an LLM call with a structured prompt that asks for a score and justification.

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
import openai
import json
from dataclasses import dataclass

@dataclass
class JudgeResult:
    score: int           # 1-5
    reasoning: str
    criteria_scores: dict[str, int]

def evaluate_response(
    question: str,
    agent_response: str,
    reference_answer: str,
    client: openai.OpenAI,
    model: str = "gpt-4o",
) -> JudgeResult:
    prompt = f"""You are an expert evaluator. Score the following agent response.

Question: {question}
Reference Answer: {reference_answer}
Agent Response: {agent_response}

Score each criterion from 1 (poor) to 5 (excellent):
1. Correctness: Is the information accurate?
2. Completeness: Does it address all parts of the question?
3. Clarity: Is the response well-organized and easy to understand?

Return JSON:
{{"correctness": <int>, "completeness": <int>, "clarity": <int>, "overall": <int>, "reasoning": "<explanation>"}}
"""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    data = json.loads(response.choices[0].message.content)
    return JudgeResult(
        score=data["overall"],
        reasoning=data["reasoning"],
        criteria_scores={
            k: data[k] for k in ["correctness", "completeness", "clarity"]
        },
    )

Designing Scoring Rubrics

Vague rubrics produce inconsistent scores. Anchor each score level to concrete behaviors.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
RUBRIC = """
## Correctness Rubric
- 5: All facts are accurate, no hallucinations
- 4: Minor inaccuracy that does not affect the main answer
- 3: One significant error, but the core answer is correct
- 2: Multiple errors or a critical factual mistake
- 1: The answer is fundamentally wrong or fabricated

## Completeness Rubric
- 5: Addresses every part of the question with sufficient detail
- 4: Addresses all parts but one lacks detail
- 3: Misses one part of a multi-part question
- 2: Only partially addresses the question
- 1: Fails to address the question at all
"""

def build_judge_prompt(question: str, response: str, rubric: str = RUBRIC) -> str:
    return f"""Evaluate this agent response using the rubric below.

{rubric}

Question: {question}
Response: {response}

Return JSON with scores and reasoning for each criterion."""

Calibration Against Human Labels

Before trusting LLM-as-Judge scores, calibrate by comparing judge scores to human ratings on a labeled subset.

import numpy as np
from scipy import stats

def calibrate_judge(
    human_scores: list[int],
    judge_scores: list[int],
) -> dict:
    """Compare judge scores against human ground truth."""
    correlation, p_value = stats.spearmanr(human_scores, judge_scores)

    exact_match = sum(h == j for h, j in zip(human_scores, judge_scores))
    within_one = sum(abs(h - j) <= 1 for h, j in zip(human_scores, judge_scores))

    return {
        "spearman_correlation": round(correlation, 3),
        "p_value": round(p_value, 4),
        "exact_match_pct": round(exact_match / len(human_scores) * 100, 1),
        "within_one_pct": round(within_one / len(human_scores) * 100, 1),
        "judge_mean": round(np.mean(judge_scores), 2),
        "human_mean": round(np.mean(human_scores), 2),
        "bias": round(np.mean(judge_scores) - np.mean(human_scores), 2),
    }

A Spearman correlation above 0.7 and within-one agreement above 85% indicates a reliable judge. If bias is consistently positive, the judge is too lenient — adjust the rubric.

Multi-Criteria Evaluation

Combine individual criteria into a weighted overall score.

def weighted_score(criteria_scores: dict[str, int], weights: dict[str, float]) -> float:
    total = sum(criteria_scores[k] * weights[k] for k in criteria_scores)
    weight_sum = sum(weights[k] for k in criteria_scores)
    return round(total / weight_sum, 2)

# For a customer support agent, correctness matters most
SUPPORT_WEIGHTS = {"correctness": 0.5, "completeness": 0.3, "clarity": 0.2}

# For a creative writing agent, clarity matters most
CREATIVE_WEIGHTS = {"correctness": 0.2, "completeness": 0.2, "clarity": 0.6}

Running Batch Evaluations

Evaluate your full dataset efficiently with concurrent judge calls.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

import asyncio

async def batch_evaluate(
    eval_cases: list[dict],
    agent_fn,
    judge_fn,
    concurrency: int = 5,
) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def process_case(case):
        async with semaphore:
            agent_output = await agent_fn(case["input"])
            judge_result = await judge_fn(
                case["input"], agent_output, case["expected"]
            )
            return {**case, "output": agent_output, "judge": judge_result}

    tasks = [process_case(c) for c in eval_cases]
    results = await asyncio.gather(*tasks)
    return results

FAQ

Does the judge model need to be stronger than the agent model?

Yes, generally. A GPT-4o judge evaluating GPT-3.5 agent outputs works well. Judging a model with an equally capable or weaker model produces unreliable scores because the judge may share the same blind spots.

How do I prevent position bias in the judge?

When comparing two responses (A vs B), run the evaluation twice — once with A first, once with B first — and average the results. This counteracts the tendency for LLMs to prefer whichever response appears first.

How much does LLM-as-Judge cost compared to human evaluation?

Evaluating 1,000 responses with GPT-4o costs roughly two to five dollars depending on response length. The same volume of human evaluation typically costs hundreds of dollars and takes days. LLM-as-Judge is roughly 50-100x cheaper.


#LLMasJudge #Evaluation #AIAgents #AutomatedTesting #Python #ScoringRubrics #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

Agentic AI

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

Pairwise (A vs B) LLM-as-judge evaluation produces sharper, more reliable signal than absolute scoring for non-deterministic agent outputs. Here is why and how.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.