Documentation/Concepts/Datasets
1 min read

Datasets

A dataset is a versioned collection of examples you evaluate against. Each example should represent a scenario you care about and should remain stable so that comparisons across time are meaningful.

What an example should include

At minimum:

  • Input: user message / request payload / conversation state
  • Expected signal: rubric notes, constraints, or a reference answer
  • Metadata: tags for slicing results (intent, language, customer tier)

How to build datasets that stay useful

  • Start with production failures (sanitized) from traces.
  • Add edge cases that matter to your product (security, refusal, tools).
  • Maintain a holdout set so you don’t overfit while iterating.
  • Version changes to labels/rubrics; otherwise trend lines lie.

Dataset size guidance

  • 20–50 examples: enough to start catching obvious regressions
  • 100–300 examples: stable trend lines and segment slices
  • 1k+: use stratified sampling + automation to keep runs fast

Next steps