Evaluation runs
An evaluation run executes a workflow (prompt + model + tools + config) against a fixed dataset. The output is a set of scores and artifacts you can compare against baselines.
What gets compared
Eval runs are most useful when you can pin the variables you care about:
- prompt version
- model + provider
- temperature and decoding parameters
- tool availability and tool implementations
- retrieval configuration (index, top-k, reranker)
Outputs of a run
A good eval run produces:
- per-example scores and rationales (human or judge)
- aggregate metrics (pass rate, mean score, p95 latency, cost)
- links back to traces for failure investigation
Baselines and gates
Common baselines:
- last production version (“champion”)
- last release tag
- a curated “golden” prompt version
Common gates:
- pass rate must not decrease by more than X%
- critical failure categories must stay at 0
- latency/cost must remain within budgets
Next steps
- Read Evaluations for dataset and scoring setup.