Datasets
A dataset is a versioned collection of examples you evaluate against. Each example should represent a scenario you care about and should remain stable so that comparisons across time are meaningful.
What an example should include
At minimum:
- Input: user message / request payload / conversation state
- Expected signal: rubric notes, constraints, or a reference answer
- Metadata: tags for slicing results (intent, language, customer tier)
How to build datasets that stay useful
- Start with production failures (sanitized) from traces.
- Add edge cases that matter to your product (security, refusal, tools).
- Maintain a holdout set so you don’t overfit while iterating.
- Version changes to labels/rubrics; otherwise trend lines lie.
Dataset size guidance
- 20–50 examples: enough to start catching obvious regressions
- 100–300 examples: stable trend lines and segment slices
- 1k+: use stratified sampling + automation to keep runs fast
Next steps
- Read Evaluation runs to see how datasets are executed.
- Read Evaluations for scoring strategies.