Loading…
Loading…
An AI agent without evaluation is a Schrödinger's bug: it might be working brilliantly or silently regressing, and you only find out from customer complaints. This guide covers how to stand up a working evaluation framework from zero—the eval set design, the scoring approach, the CI integration, and the disciplines that make evaluation an actual safety net rather than a checkbox.
Written by Max Zeshut
Founder at Agentmelt
Unit tests check deterministic outputs: input A produces output B, every time. Agent evaluations check probabilistic behavior: input A should produce output in a class of acceptable outputs, most of the time. Evaluation involves rubrics (not exact-match assertions), sampling (most evals are too expensive to run on every commit), and statistical reasoning about regressions (a 3-point score drop on a 200-case eval may be noise or may be a real regression—you need to know which). Treating evals like unit tests produces flaky red builds and trains the team to ignore them.
Aim for 100-300 representative tasks before you cross the 'this set is useful' threshold; some teams have 1,000+ at maturity. Sources: real production traffic samples (anonymized), known historical failures (every incident becomes a permanent test), edge cases the team can articulate, and adversarial inputs you'd want the agent to handle gracefully. Diversity matters more than volume—50 well-chosen cases that cover the failure modes outperform 500 redundant cases that all look the same. Tag each case with metadata (difficulty, category, customer segment) so you can score by dimension.
For each case, define what 'correct' means. Options: (1) exact-match on a final answer (works for classification, extraction), (2) graded scoring against a rubric ('1-5 on helpfulness, 1-5 on safety, 0 or 1 on policy compliance'), (3) preference scoring (model A vs model B, judged by a third model or human), and (4) execution-based scoring (does the generated code pass these tests). Most production eval sets use a mix. The rubric must be specific enough that two reasonable humans applying it produce the same score 80%+ of the time—if not, the rubric is the bug.
Three options: human review (gold standard, expensive, slow), LLM-as-judge (cheap, fast, biased toward verbose answers and confident-sounding wrong answers), and code-based evaluators (deterministic, only work for well-structured outputs). Production teams use all three: code-based where possible, LLM-as-judge for everything code can't grade, and human review as the final calibration on a sample. Cross-check your LLM judge against human labels regularly—judges drift in subtle ways, and a drifting judge produces a quietly broken eval set.
Every change that could affect agent behavior runs the eval set automatically: prompt changes, model upgrades, tool changes, retrieval changes, framework upgrades. Block merges on regressions beyond a defined threshold. Many teams add 'eval review' to PRs the way 'code review' is required—the diff includes both the code change and the eval-score change. This is the practice that turns evaluation from a one-time checkpoint into [[agent-eval-driven-development]].
Pre-launch evaluation catches known failure modes. Production evaluation catches unknown ones. Sample 1-5% of real production traffic, send the output to your judge pipeline (or human review), and watch for drift in scores over time. When production scores drop, find the new failure mode, add it to your eval set, fix it, and prevent its recurrence. The eval set is a living artifact—it should grow by 10-30% over the first year of production as new failure modes are discovered and pinned down.
Don't: (1) evaluate on the same data you tuned the prompt against—you'll just measure how well you memorized the eval set, (2) optimize for the metric in ways that hurt the underlying goal (a 99% score on an eval that doesn't capture what users actually want is worse than 80% on one that does), (3) skip the rubric specificity step—'is this answer good?' is not a rubric, (4) ignore variance—one run is noise; aggregate 3-5 runs at a stable temperature. Most failed eval frameworks die from one of these four problems.
For a single agent: 2-4 weeks to a working v1 (100-case eval set, scoring rubric, CI integration). Add another 4-8 weeks to reach maturity (broader coverage, calibrated LLM judge, continuous production sampling). For teams running multiple agents, the same framework typically extends to new agents in days, not weeks—the infrastructure investment compounds.
Start with a third-party platform (Braintrust, LangSmith, Logfire, Phoenix, Arize) unless you have specialized needs. The platforms ship the boring infrastructure—dataset versioning, judge orchestration, score aggregation, regression alerts—so your team can focus on what's specific to your domain (which cases to include, how to score them, what counts as a regression). The handful of teams that build their own usually do so for compliance or for very large eval workloads where licensing costs exceed engineering investment.
Depends entirely on the task and rubric. The number on its own is meaningless; what matters is (1) the trend over time (going up = improving, flat = at ceiling, down = regression), (2) the breakdown by category (which case types are failing?), and (3) the comparison to your alternative (human-only, prior model, prior prompt). Anchor your team on relative measurements, not absolute thresholds.