Judges & scoring
There is no single “best” evaluation method. Strong evaluation stacks combine:
- Programmatic checks (fast, deterministic)
- Human review (nuanced, expensive)
- LLM judges (scalable, imperfect)
Programmatic checks (always recommended)
Use them as hard gates:
- JSON schema validation
- required fields and types
- forbidden phrases / required citations
- tool-call constraints (must call tool X, must not call tool Y)
These checks are cheap and great for CI.
Human review (high signal)
Use human review for:
- factual correctness in complex domains
- brand voice and style
- safety/policy compliance
Guidelines:
- keep rubrics short and consistent
- sample a fixed number of examples each run
- track disagreement and calibrate reviewers periodically
LLM-as-judge (scalable)
LLM judges are best for relative comparisons when paired with clear rubrics.
Judge prompts should:
- define a strict rubric
- request structured JSON output
- forbid extra text outside JSON
- evaluate A vs B when possible (pairwise reduces calibration pain)
Common judge failure modes
- position bias: prefers first or second answer
- verbosity bias: favors longer answers
- self-consistency: hides uncertainty with confident scores
Mitigations:
- randomize answer order
- enforce length constraints
- include “must cite sources” rules when relevant
Score types you should track
- quality: correctness/helpfulness
- safety: refusal and policy compliance
- tooling: correct tool use and arguments
- format: JSON validity, schema compliance
- cost/latency: token usage and time budgets