Documentation/Guides (Evaluations)/CI gates

1 min read

CI gates

CI gates keep quality regressions from reaching production. The goal is to have fast, reliable checks on every change and deeper suites on a schedule.

Two-tier strategy

PR tier (fast)

10–50 examples (stratified)
programmatic checks (schema, constraints)
quick judge (optional)
compare vs baseline

Nightly tier (thorough)

full datasets
strong judges + calibrated rubrics
latency and cost budgets

Gate types

Use a combination:

hard failures: JSON invalid, policy violations, tool misuse
regression thresholds: pass rate cannot drop more than X%
budget thresholds: p95 latency and cost per example within limits

Handling flaky signals

If your eval has variance:

run multiple seeds (or temperature 0)
use pairwise judge scoring
gate only on stable metrics; report the rest

Promotion workflow

Recommended:

PR passes fast suite → merge
staging promotion runs full suite → approve
prod promotion → monitor traces, errors, budgets → rollback if needed

Next steps

Run evals in CI