20 February 2026·2 min readEngineeringEvaluation

Evaluation is not a phase-two upgrade

If a system can't be watched, graded and improved, it shouldn't be in production. The case for writing the harness on week one.

Ajay Dhillon

Founder

When we join an engagement mid-flight, the most common missing piece is not the model choice or the retrieval pipeline. It is evaluation. Not the ten-example notebook. The discipline. The per-workflow scoreboard, the regression suite, the production-grade evaluation harness that runs on every deploy and every model swap.

Without that, every change is a coin flip. With it, the team can move faster, swap models, tighten prompts, argue from numbers rather than intuition. And hand something defensible to the governance team when they ask.

The three layers we insist on

Offline evaluation. A curated suite of inputs with expected outputs or expected properties, run on every commit and every model swap. This is the regression net.

Online evaluation. Automated scoring of a fraction of live traffic, sampled intelligently, with human review in the loop for disputed cases. This is the production heartbeat.

Red-team evaluation. Adversarial prompts, jailbreak attempts, policy-violation probes, re-run every release. This is the governance proof.

The cost objection

Teams sometimes argue they can't afford this level of instrumentation. Our experience is the opposite. The teams that can't afford evaluation are the ones that stall in the second mile and quietly spend quarters debugging vibes.

Evaluation is cheap. Rework is expensive. Losing the trust of the business is catastrophic.

We write the harness on week one. We staff the on-call from week one. We publish the scoreboard from week one. Everything else is downstream of that discipline.

Frequently asked

Why should AI evaluation be built on day one, not later? Because every change to the system - prompt, model version, retrieval source - is a coin flip without it. Teams that defer evaluation spend quarters debugging regressions they can't reproduce. Teams that build it on week one ship with confidence and argue from numbers.

What are the three layers of a production evaluation practice? Offline evaluation (a curated golden set run on every commit), online evaluation (automated scoring of sampled live traffic with human review in the loop), and red-team evaluation (adversarial probes re-run every release). All three together; none of them are optional in a regulated deployment.

Is evaluation expensive to stand up? No. A minimum viable harness takes a week to build and catches regressions worth months of rework. The teams that claim they can't afford evaluation are the ones paying for its absence quietly, in stalled programmes and lost trust.

Who owns the evaluation harness once it's live? The delivery team that built it. Evaluation is operational, not a one-off audit - golden sets drift, rubrics need tuning, new failure modes need probes. The people who built the system are the ones who can keep the harness honest.

Written by

Ajay Dhillon · Founder

Evaluation is not a phase-two upgrade

The three layers we insist on

The cost objection

Frequently asked

Keep reading

Production AI vs pilot: the complete guide to shipping AI that survives

Enterprise AI transformation: what it actually means

Let’sbuildyoursystemnext.