Documentation/Concepts/Datasets

1 min read

Datasets

A dataset is a versioned collection of examples you evaluate against. Each example should represent a scenario you care about and should remain stable so that comparisons across time are meaningful.

What an example should include

At minimum:

Input: user message / request payload / conversation state
Expected signal: rubric notes, constraints, or a reference answer
Metadata: tags for slicing results (intent, language, customer tier)

How to build datasets that stay useful

Start with production failures (sanitized) from traces.
Add edge cases that matter to your product (security, refusal, tools).
Maintain a holdout set so you don’t overfit while iterating.
Version changes to labels/rubrics; otherwise trend lines lie.

Dataset size guidance

20–50 examples: enough to start catching obvious regressions
100–300 examples: stable trend lines and segment slices
1k+: use stratified sampling + automation to keep runs fast

Next steps

Read Evaluation runs to see how datasets are executed.
Read Evaluations for scoring strategies.