Documentation/Guides (Evaluations)/Judges & scoring
1 min read

Judges & scoring

There is no single “best” evaluation method. Strong evaluation stacks combine:

  • Programmatic checks (fast, deterministic)
  • Human review (nuanced, expensive)
  • LLM judges (scalable, imperfect)

Use them as hard gates:

  • JSON schema validation
  • required fields and types
  • forbidden phrases / required citations
  • tool-call constraints (must call tool X, must not call tool Y)

These checks are cheap and great for CI.

Human review (high signal)

Use human review for:

  • factual correctness in complex domains
  • brand voice and style
  • safety/policy compliance

Guidelines:

  • keep rubrics short and consistent
  • sample a fixed number of examples each run
  • track disagreement and calibrate reviewers periodically

LLM-as-judge (scalable)

LLM judges are best for relative comparisons when paired with clear rubrics.

Judge prompts should:

  • define a strict rubric
  • request structured JSON output
  • forbid extra text outside JSON
  • evaluate A vs B when possible (pairwise reduces calibration pain)

Common judge failure modes

  • position bias: prefers first or second answer
  • verbosity bias: favors longer answers
  • self-consistency: hides uncertainty with confident scores

Mitigations:

  • randomize answer order
  • enforce length constraints
  • include “must cite sources” rules when relevant

Score types you should track

  • quality: correctness/helpfulness
  • safety: refusal and policy compliance
  • tooling: correct tool use and arguments
  • format: JSON validity, schema compliance
  • cost/latency: token usage and time budgets

Next steps