Documentation/Guides (Evaluations)/Dataset design
1 min read

Dataset design

Your evaluation is only as good as your dataset. A great dataset is representative, versioned, and sliceable (so you can find who regressed).

What an example should contain

At minimum:

  • Input: user request/message and relevant context
  • Expected signal: rubric notes, constraints, or reference output
  • Metadata: fields used for slicing (intent, language, customer tier, difficulty)

If your workflow uses tools or retrieval, include:

  • tool allowlist or tool policy assumptions
  • retrieval configuration (index name, top-k, reranker) as metadata

Coverage strategy

Start by covering:

  • Top intents: the 5–10 most common user tasks
  • High-stakes flows: compliance-sensitive outputs, irreversible actions, financial/medical-adjacent
  • Known failure modes: tool misuse, hallucinations, policy refusals, prompt injection

Then expand using production data.

Sourcing examples

Best sources (in order):

  1. Production failures from traces (sanitized)
  2. Support tickets and customer feedback
  3. Edge cases authored by engineers/PMs (rare but critical)
  4. Synthetic cases (useful for coverage, but don’t let them dominate)

Train/validation/holdout split

If you iterate prompts by looking at failures, you will overfit unless you keep a holdout.

Recommended:

  • Tune set: used for iteration
  • Holdout set: never used during iteration; only used to validate “real” progress

Metadata that pays off

Add metadata early so you can slice results later:

  • intent: e.g. summarize, extract, classify, agent_tooling
  • domain: e.g. support, sales, legal
  • lang: en, es, ja
  • difficulty: easy|medium|hard
  • customerTier: free|pro|enterprise

Versioning datasets

When you change:

  • labels/rubrics
  • schema of examples
  • the meaning of scores

…create a new dataset version. Otherwise trend lines lie and baselines are not comparable.

Size guidance

  • 20–50: first regression signal
  • 100–300: stable metrics + slicing
  • 1k+: use stratified sampling for CI and full runs nightly

Next steps