The Eval Checklist

Nine pieces of LLM evaluation, in the order you build them. Each item is a testable claim about your eval implementation. The unchecked items are your gaps.

Essential items apply to every eval system. Items marked (when applicable) depend on your system type — RAG, agents, multi-turn, etc.

01 Instrument & Collect Traces

No traces, no evals. Full execution records are the raw material.

  • Capturing input + final output for every request
  • Capturing intermediate steps — tool calls, retrieved docs, reasoning
  • Structured logging — role-tagged (user / assistant / tool / system), timestamped
  • (when applicable) Multi-turn state tracking — what the system knew at each turn
  • (when applicable) Agent traces — tool selection, parameters, error handling, step count, cost

02 Error Analysis → Failure Taxonomy

Domain expert reads 50–100 traces. Categories emerge from observation. Output: 5–10 failure modes ranked by impact. This is 60–80% of real development effort.

  • Domain expert reviewed 50–100 traces personally, marking each pass/fail
  • Categories emerged from traces — not brainstormed or borrowed from another project
  • 5–10 distinct failure modes with computed rates, ranked by frequency × impact
  • Saturation reached — no new failure types in the last 20 traces
  • Refined 2–3 times — merged overlapping categories, split too-broad ones, re-labeled
  • (when applicable) Custom review interface — one trace at a time, pass/fail buttons, keyboard shortcuts
  • (when applicable) Sampling beyond random — e.g. embed traces and sample from each cluster, sample by user segment or time period, or prioritize traces where automated evaluators disagree
  • (when applicable) Multi-turn: first failure point labeled — errors cascade, fix upstream first

03 Fix the Obvious

Before building evaluators: fix what doesn't need one.

  • Prompt gaps fixed — missing instructions, ambiguous directives
  • Engineering bugs fixed — parsing errors, truncated context, wrong model version
  • (when applicable) Missing tools added — agent couldn't retrieve context or take a needed action
  • Error analysis re-run on fresh traces after fixes to confirm what remains

04 Curate Golden Dataset

Labeled examples with known-correct pass/fail per failure mode. Your ground truth. ~100 to start.

  • Binary labels per failure category for each trace — not a single overall score
  • Train / dev / test split — ~15% train (provides few-shot examples for your judges), ~40% dev (tune judge prompts against), ~45% test (held out, measured exactly once at the end)
  • No contamination — test data never appears in few-shot prompts or training data
  • Stratified by failure mode — each split has proportional representation
  • (when applicable) Dataset evolves over time — remove examples the system always gets right (they're no longer testing anything), add new edge cases found in production. Prevents overfitting to a static test set

05 Code-Based Checks

Deterministic assertions. Always try code before an LLM judge — fast, cheap, no hallucinations.

  • Format & schema validation — JSON structure, required fields, type checks
  • Internal consistency — totals match parts, percentages sum, date ranges valid
  • Tried code first for every failure mode before reaching for an LLM judge
  • (when applicable) Inline guardrails — real-time blocks (PII, safety) separate from async evaluators

06 LLM-as-Judge

Binary pass/fail for subjective failures. One judge per failure mode. No similarity metrics, no Likert scales.

  • One judge per failure mode — not one catch-all multi-criteria judge
  • Binary pass/fail output — no scales, grades, or numeric scores
  • Critique before verdict — judge outputs {"critique": "…", "result": "Pass/Fail"} so it must reason before deciding, like chain-of-thought for evaluation
  • Explicit pass/fail definitions with edge cases, grounded in observed failure patterns
  • Few-shot examples included in the prompt — show the judge at least one clear pass, one clear fail, and one borderline case. These come from the train split of your golden dataset (step 04), never from dev or test
  • Minimal context per judge — only the trace slice relevant to that criterion
  • (when applicable) Criteria decomposition — instead of one judge for "is this good?", split into independent sub-judges: "are the numbers accurate?", "are claims supported?", "does it answer the question?" Each scored separately, then combined
  • (when applicable) Consensus scoring — run the same judge 3–5× on the same trace, take the majority verdict. Smooths out randomness in LLM judgment
  • (when applicable) Tiered rationales — judge first does a quick pass ("is this obviously wrong?"), then only applies detailed reasoning to borderline cases. Reduces cost while improving accuracy on hard calls
  • (when applicable) Agents-as-Judge — for factual accuracy, deploy separate agents that each verify one narrow claim type (e.g. one checks plot facts, another checks award facts). Simpler context per agent = higher reliability
  • (when applicable) RAG: retrieval and generation evaluated separately — did the retriever find the right documents? (Recall@k) Then: is the generated answer faithful to those documents and does it answer the question?
  • (when applicable) Auto-fix loop — evaluator detects problem → regenerate, max retries set, both versions logged

07 Validate Evaluators

Without calibration, your judge is an opinion. With it, a calibrated instrument with known error characteristics.

  • TPR and TNR measured — True Positive Rate: when a human says Pass, how often does the judge agree? True Negative Rate: when a human says Fail, how often does the judge agree? Raw accuracy is misleading — a judge that always says "Pass" gets 90% accuracy if 90% of traces pass, but catches zero failures
  • Both TPR and TNR > 80% minimum, targeting > 90%. Below 80% the judge is not reliable enough to trust
  • Iterated on dev set only — test set touched exactly once for final measurement
  • Disagreements inspected — False Pass → strengthen Fail definition; False Fail → clarify Pass definition
  • Exact model version pinned — gpt-4o-2024-05-13, not gpt-4o
  • Rogan-Gladen correction applied — your judge has known error rates (TPR/TNR). When it reports "85% of production traces pass," that raw number is biased. The correction adjusts: true_rate = (observed_rate + TNR − 1) / (TPR + TNR − 1). Report the corrected number to stakeholders
  • Confidence intervals computed — resample your test labels many times and recompute the metric each time to get error bars. A single number ("92% pass rate") without a range is false precision
  • Re-validation plan exists — triggered by prompt changes, model switches, or CI drift
  • (when applicable) User behavior validation — judge scores correlated with actual outcome metrics

08 Generate Synthetic Data

Stress-test coverage gaps. Production data clusters around common paths; synthetic data fills the sparse regions deliberately.

  • 3 dimensions of variation defined — pick 3 axes that target where you expect failures. For a support bot: Feature (billing / returns / account) × Customer Type (new / power user / angry) × Complexity (simple / multi-step / ambiguous). Each combination is a test scenario
  • Combinations validated by domain expert — at least 20 human-reviewed (dimension₁, dimension₂, dimension₃) combinations. The expert confirms which are realistic and which are nonsense before you generate at scale
  • Two-step generation — first generate the structured combinations, then convert each to natural language in a separate step. One-step produces repetitive phrasing; two-step produces diverse, realistic queries
  • Quality-filtered — awkward phrasing, duplicates, and unrealistic scenarios removed by the domain expert
  • Run through full pipeline — synthetic inputs hit your system, produce traces, and those traces are evaluated by your validated judges from steps 05–07
  • (when applicable) RAG adversarial questions — questions designed to confuse the retriever by using terminology that appears in irrelevant documents. Tests whether the system retrieves the right chunk, not just a keyword-matching one

09 CI/CD + Production Monitoring

CI catches regressions before they ship. Monitoring catches drift after. Same evaluators, different jobs.

  • Eval suite runs pre-deploy — code checks + judges on golden dataset subset
  • Deploy blocked on pass-rate drop below threshold per failure category
  • Production traffic sampled continuously and evaluated with validated judges
  • Aggregate rates bias-corrected — when reporting "X% of outputs pass quality" to stakeholders, apply the Rogan-Gladen formula from step 07 to account for known judge error
  • Prompts version-controlled in git with history, diff, and rollback
  • Drift triggers return to step 02 — declining rates, new failure modes, judge accuracy drift
  • Production insights fed back — new failure modes → taxonomy, novel traces → golden dataset
  • (when applicable) Smart sampling — beyond random: sample proportionally across user segments and time periods (stratified), prioritize traces where judges disagree (uncertainty), and pull traces that triggered guardrails or complaints (failure-driven)

The loop. After significant changes or drift, return to step 02. Re-run error analysis on fresh traces. The taxonomy evolves. Evaluators sharpen. The system that evaluates the system is itself evaluated.


Sources

Hamel Husain & Shreya Shankar — "LLM Evals: Everything You Need to Know" and the evals-skills Claude Code plugin. The methodology backbone: error analysis, binary evals, judge design, TPR/TNR validation, Rogan-Gladen correction, annotation tooling.

Netflix Tech Blog — "Evaluating Show Synopses with LLM-as-a-Judge" (Alessio, Taylor, Wolfe). Production case study: tiered rationales, consensus scoring, agents-as-judge, member validation against streaming metrics.

Kwok et al. — "LLM-as-a-Verifier" (Stanford / UC Berkeley / NVIDIA). Criteria decomposition, repeated verification, scoring granularity. 86.4% SOTA on Terminal-Bench 2.

Cameron R. Wolfe — "The Anatomy of an LLM Benchmark." Contamination detection, dynamic evaluation, Item Response Theory.