Skip to content
All Posts
Agentic AI

Agentic AI & LLM Engineering

Deep dives into agentic AI, LLM evaluation, synthetic data generation, model selection, and production AI engineering best practices.

9 of 426 articles

Agentic AI
13 min read0May 5, 2026

Regression Testing for AI Agents: Catching Silent Breakage Before Users Do

Non-deterministic agents break silently when prompts, models, or tools change. Build a regression pipeline with frozen datasets, semantic diffing, and gate thresholds.

Agentic AI
13 min read0May 5, 2026

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

Offline evals catch regressions before deploy on a fixed dataset. Online evals catch real-world drift on live traffic. You need both — here is how we run them.

Agentic AI
12 min read1May 5, 2026

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

Tracing fundamentals for production AI agents — span hierarchy, session correlation, and the failure patterns that only show up when you trace every step.

Agentic AI
12 min read0May 5, 2026

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

Pairwise (A vs B) LLM-as-judge evaluation produces sharper, more reliable signal than absolute scoring for non-deterministic agent outputs. Here is why and how.

Agentic AI
13 min read0May 5, 2026

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Agentic AI
13 min read0May 5, 2026

Evaluating Multi-Step Tool-Using Agents: Why End-to-End Metrics Lie

A 'did the agent answer correctly?' pass/fail hides broken tool calls, wasted tokens, and silent retries. Here is how to evaluate intermediate steps.

Agentic AI
12 min read0May 5, 2026

How to Build a Golden Dataset for Production AI Agents

A principal engineer's playbook for curating, versioning, and growing a golden dataset for an agent — from production trace mining to annotation queues in LangSmith.

Agentic AI
12 min read0May 5, 2026

Cost-Aware Agent Evaluation: Putting Token Spend, Latency, and Quality on the Same Dashboard

Eval scores alone mislead. Here is how we build a Pareto view across cost, latency, and quality so agent releases ship on signal, not vibes.