Documentation/Guides (Evaluations)/Calibration & baselines
1 min read

Calibration & baselines

The hardest part of evals is not computing scores—it’s making sure they remain comparable across time as prompts, models, and judge behavior drift.

Baselines (champions)

Pick a baseline you compare everything against:

  • last production prompt version
  • last release tag
  • a hand-picked “golden” version

Keep the baseline stable for a period; otherwise comparisons churn.

Drift sources

  • model upgrades (both target model and judge model)
  • dataset changes (new labels, new examples)
  • prompt rubric changes for judges

Calibration set

Maintain a small calibration dataset (20–50 examples) that:

  • rarely changes
  • includes obvious passes and fails
  • covers your top intents

Run it frequently to detect drift in judges and scoring.

Pairwise comparisons

When possible, score A vs B on the same input:

  • reduces judge scale calibration issues
  • produces more stable “which is better” signals

Reporting

Recommended reporting for every run:

  • aggregate metrics (pass rate, mean score)
  • deltas vs baseline
  • slices by metadata
  • top regressions list with links to traces/artifacts

Next steps