Introduction

Boson is an LLM engineering platform for observability, evaluation, and prompt management.

Boson Cloud, SDKs, and what is open source

Boson Cloud (the hosted backend, web app, and related services) is proprietary—it is not open source.
Boson’s SDKs are open source so you can audit how instrumentation works and integrate them into your codebase.
SDKs are built to send telemetry and use platform features only against Boson’s official backend (Boson Cloud, or a customer-deployed Boson instance Boson provides for your organization). They are not intended to be pointed at a custom or third-party observability backend.

Use this documentation to:

Instrument your app and inspect traces
Create evaluation datasets and compare prompt/model changes
Track cost, latency, and quality over time

Who this is for

Boson is built for teams shipping LLM features in production:

AI engineers debugging multi-step agent workflows
Backend engineers instrumenting model calls alongside app telemetry
Product teams that need repeatable, measurable quality improvements

If you’re prototyping, you can still use Boson—just start with tracing and add evals once you have representative examples.

Motivations

LLM apps are hard to debug: failures are often non-deterministic and buried inside multi-step chains. Boson makes issues visible with end-to-end traces and repeatable evaluation runs.

Common problems Boson helps with:

“Why did the agent choose this tool?” → Inspect spans, inputs/outputs, and intermediate reasoning artifacts you store.
“A prompt change improved one case but broke others.” → Run evals on a dataset and compare runs.
“Latency spiked after we changed models.” → Break down timings per step and per provider call.

Core concepts

Boson’s docs assume a few concepts:

Project: a workspace boundary (team/environment) for traces, datasets, and prompts.
Trace: a single end-to-end run (e.g. one user request).
Span: a step inside a trace (e.g. “retrieve context”, “call model”, “tool execution”).
Dataset: a collection of examples used for evaluation.
Evaluation run: executing the same workflow against a dataset to measure regressions/improvements.

Features

Tracing & observability: inspect inputs/outputs, timings, metadata, and failures
Evaluations: run evals on datasets and compare releases before shipping
Prompt management: version prompts, roll back safely, and standardize reuse
Cost & usage: analyze spend and performance across models and projects

How to read these docs

Start with Installation to send your first trace.
Add Tracing next so your team can debug and iterate quickly.
When you’re shipping changes, move to Evaluations to quantify improvements.

Next steps

Go to Installation → send your first trace.
Read Product / Tracing → learn what to capture and how to structure spans.