Evaluation·Also: Model-graded evaluation / Rubric-based scoring

LLM-as-judge

Using a language model to score the outputs of another language model against a rubric - a cheaper, noisier substitute for human review in AI evaluation.

LLM-as-judge is the practice of using a separate language model - usually a reasoning-tier model with a carefully-written rubric - to score the outputs of your production system. It sits between exact-match scoring (too narrow for open-ended tasks) and human review (accurate but expensive and slow).

When to use it

  • Open-ended outputs where there isn't one right answer, only better and worse answers.
  • Style, tone and format checks where structured evaluation is hard.
  • Multi-dimensional scoring where a single output has to pass several rubric criteria.

When it fails

  • Without a good rubric. LLM judges are reliable only when the rubric is specific, grounded in examples, and tested against human labels.
  • On safety-critical outputs. Safety and policy violations should be caught by the red-team harness and policy-as-code, not by a judge model.
  • Without calibration. Run a sample of judged outputs past human reviewers quarterly to confirm the judge is tracking ground truth.

Where it fits in an evaluation harness

Typical pattern: exact-match scoring for the half of your golden set where ground truth is clean, LLM-as- judge for the open-ended rest, and human review for a 5% sample plus anything the judge scored as low-confidence. Full context: Building evaluation harnesses.

10 · Start here

Let’sbuildyoursystemnext.

Thirty minutes with someone who’d be doing the work. No slide deck, no intake form. We’ll tell you what’s feasible, where you’ll hit friction, and what we’d pick up first.

Response
< 24 hours
First read
No NDA needed
Bangalore / Remote
UTC ±12