Dataset design

Your evaluation is only as good as your dataset. A great dataset is representative, versioned, and sliceable (so you can find who regressed).

What an example should contain

At minimum:

Input: user request/message and relevant context
Expected signal: rubric notes, constraints, or reference output
Metadata: fields used for slicing (intent, language, customer tier, difficulty)

If your workflow uses tools or retrieval, include:

tool allowlist or tool policy assumptions
retrieval configuration (index name, top-k, reranker) as metadata

Coverage strategy

Start by covering:

Top intents: the 5–10 most common user tasks
High-stakes flows: compliance-sensitive outputs, irreversible actions, financial/medical-adjacent
Known failure modes: tool misuse, hallucinations, policy refusals, prompt injection

Then expand using production data.

Sourcing examples

Best sources (in order):

Production failures from traces (sanitized)
Support tickets and customer feedback
Edge cases authored by engineers/PMs (rare but critical)
Synthetic cases (useful for coverage, but don’t let them dominate)

Train/validation/holdout split

If you iterate prompts by looking at failures, you will overfit unless you keep a holdout.

Recommended:

Tune set: used for iteration
Holdout set: never used during iteration; only used to validate “real” progress

Metadata that pays off

Add metadata early so you can slice results later:

intent: e.g. summarize, extract, classify, agent_tooling
domain: e.g. support, sales, legal
lang: en, es, ja
difficulty: easy|medium|hard
customerTier: free|pro|enterprise

Versioning datasets

When you change:

labels/rubrics
schema of examples
the meaning of scores

…create a new dataset version. Otherwise trend lines lie and baselines are not comparable.

Size guidance

20–50: first regression signal
100–300: stable metrics + slicing
1k+: use stratified sampling for CI and full runs nightly

Next steps

Judges & scoring