Run evals in CI
CI evals prevent quality regressions from shipping unnoticed.
What to run in PRs vs nightly
PR (fast):
- small, representative subset (10–50 examples)
- strict programmatic checks (schema, required fields)
- compare against the current baseline
Nightly (thorough):
- full dataset(s)
- LLM judge scoring
- latency and cost analysis
Recommended gating rules
- critical failures must stay at 0
- pass rate must not regress beyond a threshold
- p95 latency must remain within budget
Next steps
- Read Evaluation runs.
- Read Evaluations for scoring strategies.