LLM-as-judge
Using a language model to score the outputs of another language model against a rubric - a cheaper, noisier substitute for human review in AI evaluation.
LLM-as-judge is the practice of using a separate language model - usually a reasoning-tier model with a carefully-written rubric - to score the outputs of your production system. It sits between exact-match scoring (too narrow for open-ended tasks) and human review (accurate but expensive and slow).
When to use it
- Open-ended outputs where there isn't one right answer, only better and worse answers.
- Style, tone and format checks where structured evaluation is hard.
- Multi-dimensional scoring where a single output has to pass several rubric criteria.
When it fails
- Without a good rubric. LLM judges are reliable only when the rubric is specific, grounded in examples, and tested against human labels.
- On safety-critical outputs. Safety and policy violations should be caught by the red-team harness and policy-as-code, not by a judge model.
- Without calibration. Run a sample of judged outputs past human reviewers quarterly to confirm the judge is tracking ground truth.
Where it fits in an evaluation harness
Typical pattern: exact-match scoring for the half of your golden set where ground truth is clean, LLM-as- judge for the open-ended rest, and human review for a 5% sample plus anything the judge scored as low-confidence. Full context: Building evaluation harnesses.