Evaluation·Also: Model-graded evaluation / Rubric-based scoring

LLM-as-judge

Using a language model to score the outputs of another language model against a rubric - a cheaper, noisier substitute for human review in AI evaluation.

LLM-as-judge is the practice of using a separate language model - usually a reasoning-tier model with a carefully-written rubric - to score the outputs of your production system. It sits between exact-match scoring (too narrow for open-ended tasks) and human review (accurate but expensive and slow).

When to use it

Open-ended outputs where there isn't one right answer, only better and worse answers.
Style, tone and format checks where structured evaluation is hard.
Multi-dimensional scoring where a single output has to pass several rubric criteria.

When it fails

Without a good rubric. LLM judges are reliable only when the rubric is specific, grounded in examples, and tested against human labels.
On safety-critical outputs. Safety and policy violations should be caught by the red-team harness and policy-as-code, not by a judge model.
Without calibration. Run a sample of judged outputs past human reviewers quarterly to confirm the judge is tracking ground truth.

Where it fits in an evaluation harness

Typical pattern: exact-match scoring for the half of your golden set where ground truth is clean, LLM-as- judge for the open-ended rest, and human review for a 5% sample plus anything the judge scored as low-confidence. Full context: Building evaluation harnesses.

When to use it

When it fails

Where it fits in an evaluation harness

Let’sbuildyoursystemnext.