CI gates
CI gates keep quality regressions from reaching production. The goal is to have fast, reliable checks on every change and deeper suites on a schedule.
Two-tier strategy
PR tier (fast)
- 10–50 examples (stratified)
- programmatic checks (schema, constraints)
- quick judge (optional)
- compare vs baseline
Nightly tier (thorough)
- full datasets
- strong judges + calibrated rubrics
- latency and cost budgets
Gate types
Use a combination:
- hard failures: JSON invalid, policy violations, tool misuse
- regression thresholds: pass rate cannot drop more than X%
- budget thresholds: p95 latency and cost per example within limits
Handling flaky signals
If your eval has variance:
- run multiple seeds (or temperature 0)
- use pairwise judge scoring
- gate only on stable metrics; report the rest
Promotion workflow
Recommended:
- PR passes fast suite → merge
- staging promotion runs full suite → approve
- prod promotion → monitor traces, errors, budgets → rollback if needed