Building evaluation harnesses for production AI systems
Evaluation is the single most under-invested control in enterprise AI. A practical guide to designing, building and operating an evaluation harness that catches regressions before customers do.
The thing that separates AI systems that survive a year in production from AI systems that quietly get pulled is, more often than not, the evaluation harness. Governance, observability, unit economics - all important. But none of them compound without evaluation, because without evaluation you cannot tell whether the system is getting better, worse, or neither.
This is a practical guide to building and operating an evaluation harness for a production LLM-backed system. Written for the delivery engineer who needs to ship it, not for the conference talk.
Why standard ML evaluation isn't enough
Classical ML evaluation assumes you have a labelled dataset, a trained model, and an accuracy number. None of those assumptions fit an LLM system cleanly.
- Labels are expensive. Production traffic is unlabelled. Labelling costs real money.
- The model isn't yours. You're consuming a managed model that updates on the vendor's timeline, not yours.
- Accuracy is multi-dimensional. You care about correctness, style, safety, cost, latency and refusal rate. One number doesn't summarise all of them.
- Outputs are open-ended. There often isn't one right answer - there are better and worse answers.
An evaluation harness has to handle all of these. Most of what we see skipped in production is the handling, not the measuring.
The four layers of an evaluation harness
A complete harness has four layers, each solving a different problem.
Layer 1: The golden set
A curated set of 100–500 representative inputs with expected behaviour. The golden set is the backbone; everything else relies on it.
How to build one:
- Sample real production inputs, with PII removed.
- Include the long tail. If 20% of your traffic is non-English, your golden set should be roughly 20% non-English.
- Label each example with expected behaviour - exact output if possible, otherwise a judgement rubric.
- Version it in Git. Review quarterly. Expand as you find new failure modes.
How it's used:
- Run every PR against it. Block merges that regress more than a threshold.
- Run after every model provider update.
- Track score trend over time; a flat score is fine, a declining score is an alarm.
Layer 2: Automated scoring
Running the golden set is useless if the scoring is manual. You need automated scoring. Three common patterns:
- Exact match. For extraction tasks, classification, or structured output. Cheap, precise, narrow.
- Rubric-based LLM-as-judge. A separate model scores outputs against a rubric. Cheaper than humans, noisier, acceptable for most open-ended tasks if the rubric is well-written.
- Human review. Expensive, slow, the gold standard for high-stakes outputs. Reserved for a sample (e.g., 5% of the golden set weekly, plus anything the rubric scorer is uncertain about).
Use all three, ideally in that order of preference.
Layer 3: Production sampling
The golden set is curated. Production traffic is not. Your evaluation harness needs to see real traffic.
How to implement:
- Sample 1–5% of production invocations.
- Apply the same rubric-based scoring.
- Flag low-confidence outputs for human review.
- Feed reviewed examples back into the golden set.
This is how the golden set stays representative. Without this feedback loop, the golden set goes stale in six months.
Layer 4: Red-team harness
Adversarial probes run against the system continuously.
How to build one:
- A catalogue of 50–200 probes organised by attack class (prompt injection, jailbreak, policy violation, data exfiltration, refusal bypass, etc.).
- Severity taxonomy (critical, high, medium, low) with response SLAs.
- Run nightly in staging, weekly in production.
- Coverage review quarterly.
Red-team coverage is the single most visible signal of a mature evaluation practice to external auditors, and one of the core deliverables in any serious governance engagement.
What to measure
Multi-dimensional measurement. At minimum, track:
- Correctness. Against the rubric / golden set.
- Safety. Refusal rate on policy-violating inputs, leak rate on sensitive probes.
- Style consistency. For customer-facing systems.
- Cost per task. Tokens in, tokens out, tool calls, total spend.
- Latency. p50, p95, p99.
- Refusal rate on legitimate inputs. Over-refusal kills trust.
Put these on one dashboard. Review weekly. Set alerts on meaningful drops.
Operating the harness
Building the harness is the one-off; operating it is the permanent commitment. A running harness needs:
- A weekly review meeting, 30 minutes, with the same people each time.
- A named owner of the golden set who reviews additions and removals.
- A process for acting on regressions - rollback, fix, or accept with justification.
- Vendor model-update handling. Every time the provider updates a model, re-run the full harness before promoting the new version to production.
Systems that do these things drift slowly. Systems that don't, regress invisibly.
Common mistakes
Only evaluating happy-path inputs. The golden set over-represents the demo examples. Production fails on the inputs you didn't anticipate.
A golden set of 20. Too small to detect regression. Aim for 100 minimum; 500 for anything with a non-trivial tail.
Human review without structure. "We had someone look at outputs" doesn't scale. Rubrics, sampling rates and coverage metrics turn review into data.
Evaluation as an after-thought. Teams that build evaluation in week one ship production systems. Teams that defer it ship pilots.
Silent provider updates breaking everything. Every managed model provider updates models periodically, sometimes silently. Without a harness running in CI, you find out from a customer.
A minimum viable harness, in one page
If you have a week to stand up an evaluation harness:
- Day 1: Sample 100 real production inputs, redact PII, label expected behaviour. Commit to Git.
- Day 2–3: Write a runner that executes the golden set against your system.
- Day 4: Add exact-match scoring for the half of the set where ground truth is clean.
- Day 5: Add rubric-based LLM-as-judge scoring for the rest.
- Day 6: Wire into CI. Alert if score drops by more than 5%.
- Day 7: Add production sampling at 1%, feeding into a review queue.
That's not a complete harness - no red team, no multi-dimensional metrics - but it is the foundation everything else builds on. And it is infinitely better than no harness, which is where most production AI systems live.
Related reading
Frequently asked
What is an AI evaluation harness? An automated system that runs a curated set of inputs through your AI system, scores the outputs, and catches regressions before they reach customers. It's the single most important piece of production AI infrastructure.
How big should my AI evaluation golden set be? Minimum 100 examples; 500 for anything with a non-trivial long tail. The set should be representative of production traffic, including edge cases and non-English inputs where applicable.
How often should I run AI evaluations? On every PR, after every model-provider update, and weekly against a 1–5% sample of production traffic. Red-team probes nightly in staging and weekly in production.
Can I use an LLM to score LLM outputs? Yes - LLM-as-judge is the standard for open-ended outputs. It's noisier than human review but vastly cheaper, and a well-written rubric makes it reliable enough for production use on most tasks.
If you want a production-grade evaluation harness built alongside your AI system, our governance service includes the full harness - golden set, scorers, red-team catalogue and CI integration - as a standard deliverable.