Introduction
Boson is an LLM engineering platform for observability, evaluation, and prompt management.
Boson Cloud, SDKs, and what is open source
- Boson Cloud (the hosted backend, web app, and related services) is proprietary—it is not open source.
- Boson’s SDKs are open source so you can audit how instrumentation works and integrate them into your codebase.
- SDKs are built to send telemetry and use platform features only against Boson’s official backend (Boson Cloud, or a customer-deployed Boson instance Boson provides for your organization). They are not intended to be pointed at a custom or third-party observability backend.
Use this documentation to:
- Instrument your app and inspect traces
- Create evaluation datasets and compare prompt/model changes
- Track cost, latency, and quality over time
Who this is for
Boson is built for teams shipping LLM features in production:
- AI engineers debugging multi-step agent workflows
- Backend engineers instrumenting model calls alongside app telemetry
- Product teams that need repeatable, measurable quality improvements
If you’re prototyping, you can still use Boson—just start with tracing and add evals once you have representative examples.
Motivations
LLM apps are hard to debug: failures are often non-deterministic and buried inside multi-step chains. Boson makes issues visible with end-to-end traces and repeatable evaluation runs.
Common problems Boson helps with:
- “Why did the agent choose this tool?” → Inspect spans, inputs/outputs, and intermediate reasoning artifacts you store.
- “A prompt change improved one case but broke others.” → Run evals on a dataset and compare runs.
- “Latency spiked after we changed models.” → Break down timings per step and per provider call.
Core concepts
Boson’s docs assume a few concepts:
- Project: a workspace boundary (team/environment) for traces, datasets, and prompts.
- Trace: a single end-to-end run (e.g. one user request).
- Span: a step inside a trace (e.g. “retrieve context”, “call model”, “tool execution”).
- Dataset: a collection of examples used for evaluation.
- Evaluation run: executing the same workflow against a dataset to measure regressions/improvements.
Features
- Tracing & observability: inspect inputs/outputs, timings, metadata, and failures
- Evaluations: run evals on datasets and compare releases before shipping
- Prompt management: version prompts, roll back safely, and standardize reuse
- Cost & usage: analyze spend and performance across models and projects
How to read these docs
- Start with Installation to send your first trace.
- Add Tracing next so your team can debug and iterate quickly.
- When you’re shipping changes, move to Evaluations to quantify improvements.
Next steps
- Go to Installation → send your first trace.
- Read Product / Tracing → learn what to capture and how to structure spans.