Documentation/Guides (Evaluations)/Judges & scoring

1 min read

Judges & scoring

There is no single “best” evaluation method. Strong evaluation stacks combine:

Programmatic checks (fast, deterministic)
Human review (nuanced, expensive)
LLM judges (scalable, imperfect)

Programmatic checks (always recommended)

Use them as hard gates:

JSON schema validation
required fields and types
forbidden phrases / required citations
tool-call constraints (must call tool X, must not call tool Y)

These checks are cheap and great for CI.

Human review (high signal)

Use human review for:

factual correctness in complex domains
brand voice and style
safety/policy compliance

Guidelines:

keep rubrics short and consistent
sample a fixed number of examples each run
track disagreement and calibrate reviewers periodically

LLM-as-judge (scalable)

LLM judges are best for relative comparisons when paired with clear rubrics.

Judge prompts should:

define a strict rubric
request structured JSON output
forbid extra text outside JSON
evaluate A vs B when possible (pairwise reduces calibration pain)

Common judge failure modes

position bias: prefers first or second answer
verbosity bias: favors longer answers
self-consistency: hides uncertainty with confident scores

Mitigations:

randomize answer order
enforce length constraints
include “must cite sources” rules when relevant

Score types you should track

quality: correctness/helpfulness
safety: refusal and policy compliance
tooling: correct tool use and arguments
format: JSON validity, schema compliance
cost/latency: token usage and time budgets

Next steps

Calibration & baselines