Calibration & baselines
The hardest part of evals is not computing scores—it’s making sure they remain comparable across time as prompts, models, and judge behavior drift.
Baselines (champions)
Pick a baseline you compare everything against:
- last production prompt version
- last release tag
- a hand-picked “golden” version
Keep the baseline stable for a period; otherwise comparisons churn.
Drift sources
- model upgrades (both target model and judge model)
- dataset changes (new labels, new examples)
- prompt rubric changes for judges
Calibration set
Maintain a small calibration dataset (20–50 examples) that:
- rarely changes
- includes obvious passes and fails
- covers your top intents
Run it frequently to detect drift in judges and scoring.
Pairwise comparisons
When possible, score A vs B on the same input:
- reduces judge scale calibration issues
- produces more stable “which is better” signals
Reporting
Recommended reporting for every run:
- aggregate metrics (pass rate, mean score)
- deltas vs baseline
- slices by metadata
- top regressions list with links to traces/artifacts