Regression Testing for AI Agents: Catching Silent Breakage Before Users Do
Non-deterministic agents break silently when prompts, models, or tools change. Build a regression pipeline with frozen datasets, semantic diffing, and gate thresholds.
Deep dives into agentic AI, LLM evaluation, synthetic data generation, model selection, and production AI engineering best practices.
9 of 426 articles
Non-deterministic agents break silently when prompts, models, or tools change. Build a regression pipeline with frozen datasets, semantic diffing, and gate thresholds.
Offline evals catch regressions before deploy on a fixed dataset. Online evals catch real-world drift on live traffic. You need both — here is how we run them.
Tracing fundamentals for production AI agents — span hierarchy, session correlation, and the failure patterns that only show up when you trace every step.
Pairwise (A vs B) LLM-as-judge evaluation produces sharper, more reliable signal than absolute scoring for non-deterministic agent outputs. Here is why and how.
Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.
A 'did the agent answer correctly?' pass/fail hides broken tool calls, wasted tokens, and silent retries. Here is how to evaluate intermediate steps.
A principal engineer's playbook for curating, versioning, and growing a golden dataset for an agent — from production trace mining to annotation queues in LangSmith.
Eval scores alone mislead. Here is how we build a Pareto view across cost, latency, and quality so agent releases ship on signal, not vibes.