Documentation/Concepts/Evaluation runs
1 min read

Evaluation runs

An evaluation run executes a workflow (prompt + model + tools + config) against a fixed dataset. The output is a set of scores and artifacts you can compare against baselines.

What gets compared

Eval runs are most useful when you can pin the variables you care about:

  • prompt version
  • model + provider
  • temperature and decoding parameters
  • tool availability and tool implementations
  • retrieval configuration (index, top-k, reranker)

Outputs of a run

A good eval run produces:

  • per-example scores and rationales (human or judge)
  • aggregate metrics (pass rate, mean score, p95 latency, cost)
  • links back to traces for failure investigation

Baselines and gates

Common baselines:

  • last production version (“champion”)
  • last release tag
  • a curated “golden” prompt version

Common gates:

  • pass rate must not decrease by more than X%
  • critical failure categories must stay at 0
  • latency/cost must remain within budgets

Next steps