Dataset design
Your evaluation is only as good as your dataset. A great dataset is representative, versioned, and sliceable (so you can find who regressed).
What an example should contain
At minimum:
- Input: user request/message and relevant context
- Expected signal: rubric notes, constraints, or reference output
- Metadata: fields used for slicing (intent, language, customer tier, difficulty)
If your workflow uses tools or retrieval, include:
- tool allowlist or tool policy assumptions
- retrieval configuration (index name, top-k, reranker) as metadata
Coverage strategy
Start by covering:
- Top intents: the 5–10 most common user tasks
- High-stakes flows: compliance-sensitive outputs, irreversible actions, financial/medical-adjacent
- Known failure modes: tool misuse, hallucinations, policy refusals, prompt injection
Then expand using production data.
Sourcing examples
Best sources (in order):
- Production failures from traces (sanitized)
- Support tickets and customer feedback
- Edge cases authored by engineers/PMs (rare but critical)
- Synthetic cases (useful for coverage, but don’t let them dominate)
Train/validation/holdout split
If you iterate prompts by looking at failures, you will overfit unless you keep a holdout.
Recommended:
- Tune set: used for iteration
- Holdout set: never used during iteration; only used to validate “real” progress
Metadata that pays off
Add metadata early so you can slice results later:
intent: e.g.summarize,extract,classify,agent_toolingdomain: e.g.support,sales,legallang:en,es,jadifficulty:easy|medium|hardcustomerTier:free|pro|enterprise
Versioning datasets
When you change:
- labels/rubrics
- schema of examples
- the meaning of scores
…create a new dataset version. Otherwise trend lines lie and baselines are not comparable.
Size guidance
- 20–50: first regression signal
- 100–300: stable metrics + slicing
- 1k+: use stratified sampling for CI and full runs nightly