Documentation/Guides (Evaluations)/CI gates
1 min read

CI gates

CI gates keep quality regressions from reaching production. The goal is to have fast, reliable checks on every change and deeper suites on a schedule.

Two-tier strategy

PR tier (fast)

  • 10–50 examples (stratified)
  • programmatic checks (schema, constraints)
  • quick judge (optional)
  • compare vs baseline

Nightly tier (thorough)

  • full datasets
  • strong judges + calibrated rubrics
  • latency and cost budgets

Gate types

Use a combination:

  • hard failures: JSON invalid, policy violations, tool misuse
  • regression thresholds: pass rate cannot drop more than X%
  • budget thresholds: p95 latency and cost per example within limits

Handling flaky signals

If your eval has variance:

  • run multiple seeds (or temperature 0)
  • use pairwise judge scoring
  • gate only on stable metrics; report the rest

Promotion workflow

Recommended:

  1. PR passes fast suite → merge
  2. staging promotion runs full suite → approve
  3. prod promotion → monitor traces, errors, budgets → rollback if needed

Next steps