Documentation/Guides (Evaluations)/Calibration & baselines

1 min read

Calibration & baselines

The hardest part of evals is not computing scores—it’s making sure they remain comparable across time as prompts, models, and judge behavior drift.

Baselines (champions)

Pick a baseline you compare everything against:

last production prompt version
last release tag
a hand-picked “golden” version

Keep the baseline stable for a period; otherwise comparisons churn.

Drift sources

model upgrades (both target model and judge model)
dataset changes (new labels, new examples)
prompt rubric changes for judges

Calibration set

Maintain a small calibration dataset (20–50 examples) that:

rarely changes
includes obvious passes and fails
covers your top intents

Run it frequently to detect drift in judges and scoring.

Pairwise comparisons

When possible, score A vs B on the same input:

reduces judge scale calibration issues
produces more stable “which is better” signals

Reporting

Recommended reporting for every run:

aggregate metrics (pass rate, mean score)
deltas vs baseline
slices by metadata
top regressions list with links to traces/artifacts

Next steps

CI gates