Evaluations

Evaluations turn “it feels better” into measurable change. You run the same workflow over a fixed dataset, apply scores (human and/or automated), and compare results across prompt versions, models, and releases.

They work best once you have basic Tracing in place—you will often export examples from traces into datasets.

When to invest in evals

You change prompts or models frequently and need to catch regressions.
You have high-stakes outputs (support, compliance-sensitive, medical-adjacent copy, etc.).
Your team debates quality in Slack instead of on numbers and diffs.

Building a dataset

What belongs in a dataset

Each row should represent one scenario you care about:

Input — user message, system context, or a fixture that reproduces the task.
Expected signal — gold answer, rubric notes, “must include / must not include”, or a reference tool trajectory.
Metadata — product area, difficulty, customer segment, language.

Start small (20–50 examples), then grow toward 100+ for stable trends.

Where examples come from

Production traces (sanitized) for real-world coverage.
Synthetic examples for edge cases rarely seen in the wild.
Support tickets and failure clusters from your trace explorer.

Quality of the dataset

Balance easy and hard cases; avoid a dataset of only trivial prompts.
Refresh periodically—product language and model behavior drift.
Version the dataset when you change labels or schema.

Scoring: human and automated

Human review

Use for nuance, brand voice, and safety.
Keep a short rubric (e.g. 1–5 on correctness, helpfulness, policy).
Sample a fixed percentage of production traffic for calibration.

LLM-as-judge

Fast for relative comparisons (A vs B on the same input).
Give the judge a strict rubric and ask for JSON scores + rationale.
Watch for position bias and leniency—compare judges across runs, not as absolute truth.

Programmatic checks

JSON schema validation, required fields, regex, length limits.
Tool-call assertions (correct tool, required arguments).
Cheap and great for CI alongside softer judges.

Running an evaluation

A typical workflow:

Freeze the dataset version you are scoring against.
Pin what you are testing: prompt version, model, temperature, tools.
Run the eval (batch job or CI step).
Compare to baseline: previous release, production snapshot, or champion prompt.

Run evals on every meaningful change to prompts/models—not only before major launches.

Using results

Aggregate metrics — mean score, pass rate, latency p95, cost per example.
Slice by metadata (language, intent, customer tier) to find who regressed.
Inspect failures — open the worst rows and jump to traces if linked.

CI and release gates

Add a minimum pass rate or max regression vs baseline for merges to main.
Keep CI sets fast (subset of data); run full suites nightly or pre-release.
Store artifacts: scores, judge outputs, and links to traces for auditability.

Common pitfalls

Evaluating on the same data you tuned on — hold out a test set.
One global score — a prompt can improve one segment and hurt another.
Ignoring latency and cost — quality at any price is not always shippable.

Next steps

Prompt management — version prompts and promote winners safely.
Tracing — ensure every eval failure is traceable in production.