The Eval Checklist

Nine pieces of LLM evaluation, in the order you build them. Each item is a testable claim about your eval implementation. The unchecked items are your gaps.

Essential items apply to every eval system. Items marked (when applicable) depend on your system type — RAG, agents, multi-turn, etc.

01 Instrument & Collect Traces

No traces, no evals. Full execution records are the raw material.

Capturing input + final output for every request
Capturing intermediate steps — tool calls, retrieved docs, reasoning
Structured logging — role-tagged (user / assistant / tool / system), timestamped
(when applicable) Multi-turn state tracking — what the system knew at each turn
(when applicable) Agent traces — tool selection, parameters, error handling, step count, cost

02 Error Analysis → Failure Taxonomy

Domain expert reads 50–100 traces. Categories emerge from observation. Output: 5–10 failure modes ranked by impact. This is 60–80% of real development effort.

Domain expert reviewed 50–100 traces personally, marking each pass/fail
Categories emerged from traces — not brainstormed or borrowed from another project
5–10 distinct failure modes with computed rates, ranked by frequency × impact
Saturation reached — no new failure types in the last 20 traces
Refined 2–3 times — merged overlapping categories, split too-broad ones, re-labeled
(when applicable) Custom review interface — one trace at a time, pass/fail buttons, keyboard shortcuts
(when applicable) Sampling beyond random — e.g. embed traces and sample from each cluster, sample by user segment or time period, or prioritize traces where automated evaluators disagree
(when applicable) Multi-turn: first failure point labeled — errors cascade, fix upstream first

03 Fix the Obvious

Before building evaluators: fix what doesn't need one.

Prompt gaps fixed — missing instructions, ambiguous directives
Engineering bugs fixed — parsing errors, truncated context, wrong model version
(when applicable) Missing tools added — agent couldn't retrieve context or take a needed action
Error analysis re-run on fresh traces after fixes to confirm what remains

04 Curate Golden Dataset

Labeled examples with known-correct pass/fail per failure mode. Your ground truth. ~100 to start.

Binary labels per failure category for each trace — not a single overall score
Train / dev / test split — ~15% train (provides few-shot examples for your judges), ~40% dev (tune judge prompts against), ~45% test (held out, measured exactly once at the end)
No contamination — test data never appears in few-shot prompts or training data
Stratified by failure mode — each split has proportional representation
(when applicable) Dataset evolves over time — remove examples the system always gets right (they're no longer testing anything), add new edge cases found in production. Prevents overfitting to a static test set

05 Code-Based Checks

Deterministic assertions. Always try code before an LLM judge — fast, cheap, no hallucinations.

Format & schema validation — JSON structure, required fields, type checks
Internal consistency — totals match parts, percentages sum, date ranges valid
Tried code first for every failure mode before reaching for an LLM judge
(when applicable) Inline guardrails — real-time blocks (PII, safety) separate from async evaluators

06 LLM-as-Judge

Binary pass/fail for subjective failures. One judge per failure mode. No similarity metrics, no Likert scales.

07 Validate Evaluators

Without calibration, your judge is an opinion. With it, a calibrated instrument with known error characteristics.

TPR and TNR measured — True Positive Rate: when a human says Pass, how often does the judge agree? True Negative Rate: when a human says Fail, how often does the judge agree? Raw accuracy is misleading — a judge that always says "Pass" gets 90% accuracy if 90% of traces pass, but catches zero failures
Both TPR and TNR > 80% minimum, targeting > 90%. Below 80% the judge is not reliable enough to trust
Iterated on dev set only — test set touched exactly once for final measurement
Disagreements inspected — False Pass → strengthen Fail definition; False Fail → clarify Pass definition
Exact model version pinned — gpt-4o-2024-05-13, not gpt-4o
Rogan-Gladen correction applied — your judge has known error rates (TPR/TNR). When it reports "85% of production traces pass," that raw number is biased. The correction adjusts: true_rate = (observed_rate + TNR − 1) / (TPR + TNR − 1). Report the corrected number to stakeholders
Confidence intervals computed — resample your test labels many times and recompute the metric each time to get error bars. A single number ("92% pass rate") without a range is false precision
Re-validation plan exists — triggered by prompt changes, model switches, or CI drift
(when applicable) User behavior validation — judge scores correlated with actual outcome metrics

08 Generate Synthetic Data

Stress-test coverage gaps. Production data clusters around common paths; synthetic data fills the sparse regions deliberately.

3 dimensions of variation defined — pick 3 axes that target where you expect failures. For a support bot: Feature (billing / returns / account) × Customer Type (new / power user / angry) × Complexity (simple / multi-step / ambiguous). Each combination is a test scenario
Combinations validated by domain expert — at least 20 human-reviewed (dimension₁, dimension₂, dimension₃) combinations. The expert confirms which are realistic and which are nonsense before you generate at scale
Two-step generation — first generate the structured combinations, then convert each to natural language in a separate step. One-step produces repetitive phrasing; two-step produces diverse, realistic queries
Quality-filtered — awkward phrasing, duplicates, and unrealistic scenarios removed by the domain expert
Run through full pipeline — synthetic inputs hit your system, produce traces, and those traces are evaluated by your validated judges from steps 05–07
(when applicable) RAG adversarial questions — questions designed to confuse the retriever by using terminology that appears in irrelevant documents. Tests whether the system retrieves the right chunk, not just a keyword-matching one

09 CI/CD + Production Monitoring

CI catches regressions before they ship. Monitoring catches drift after. Same evaluators, different jobs.

Eval suite runs pre-deploy — code checks + judges on golden dataset subset
Deploy blocked on pass-rate drop below threshold per failure category
Production traffic sampled continuously and evaluated with validated judges
Aggregate rates bias-corrected — when reporting "X% of outputs pass quality" to stakeholders, apply the Rogan-Gladen formula from step 07 to account for known judge error
Prompts version-controlled in git with history, diff, and rollback
Drift triggers return to step 02 — declining rates, new failure modes, judge accuracy drift
Production insights fed back — new failure modes → taxonomy, novel traces → golden dataset
(when applicable) Smart sampling — beyond random: sample proportionally across user segments and time periods (stratified), prioritize traces where judges disagree (uncertainty), and pull traces that triggered guardrails or complaints (failure-driven)

The loop. After significant changes or drift, return to step 02. Re-run error analysis on fresh traces. The taxonomy evolves. Evaluators sharpen. The system that evaluates the system is itself evaluated.

Sources

Hamel Husain & Shreya Shankar — "LLM Evals: Everything You Need to Know" and the evals-skills Claude Code plugin. The methodology backbone: error analysis, binary evals, judge design, TPR/TNR validation, Rogan-Gladen correction, annotation tooling.

Netflix Tech Blog — "Evaluating Show Synopses with LLM-as-a-Judge" (Alessio, Taylor, Wolfe). Production case study: tiered rationales, consensus scoring, agents-as-judge, member validation against streaming metrics.

Kwok et al. — "LLM-as-a-Verifier" (Stanford / UC Berkeley / NVIDIA). Criteria decomposition, repeated verification, scoring granularity. 86.4% SOTA on Terminal-Bench 2.

Cameron R. Wolfe — "The Anatomy of an LLM Benchmark." Contamination detection, dynamic evaluation, Item Response Theory.