Documentation/Quickstarts/Run evals in CI
1 min read

Run evals in CI

CI evals prevent quality regressions from shipping unnoticed.

What to run in PRs vs nightly

PR (fast):

  • small, representative subset (10–50 examples)
  • strict programmatic checks (schema, required fields)
  • compare against the current baseline

Nightly (thorough):

  • full dataset(s)
  • LLM judge scoring
  • latency and cost analysis
  • critical failures must stay at 0
  • pass rate must not regress beyond a threshold
  • p95 latency must remain within budget

Next steps