Tracing
Tracing in Boson gives you an end-to-end view of a single run—from the first user message through retrieval, tool calls, model completions, and errors. It is the fastest way to answer “what happened?” when behavior is non-deterministic or spread across many steps.
If you have not sent a trace yet, start with Installation.
Why tracing matters for LLM apps
- Multi-step flows hide failures: a bad retrieval or tool argument may only show up deep in the chain.
- Non-determinism makes logs alone insufficient—you need structured context (inputs, outputs, timings) per step.
- Production debugging needs the same shape as development: one trace per request, comparable across environments.
Core concepts
| Concept | Meaning |
|---|---|
| Trace | One logical run (e.g. one API request, one chat turn, one job). |
| Span | A unit of work inside that trace (e.g. retrieval, llm.completion, tool.execute). |
| Event | A point-in-time payload on a span (e.g. input, output, metadata). |
Aim for fewer, meaningful spans rather than a span per line of code.
What to capture
Always worth recording
- Model and provider (e.g. model id, API version if relevant).
- Latency per span (SDKs often do this automatically when you wrap calls).
- Errors and retries with the exception message and whether the call was retried.
- High-signal metadata: environment, release or git SHA, feature flags,
user_id/session_idonly if policy allows.
Inputs and outputs
- Store enough to debug (prompts, tool args, retrieved chunks summaries)—but redact or hash secrets, tokens, PII, and full document bodies if your policy requires it.
- Prefer structured fields (JSON) over huge opaque strings when you need to filter later.
Agents and tools
For each tool call, capture:
- Tool name and arguments (sanitized).
- Result or error (truncated if large).
- Parent span so the tree matches the mental model of your agent.
Recommended trace shape
A typical chat or agent request might look like:
- Root span —
traceorrequestwith request-level metadata. retrieval— query, source, number of hits, latency.llm.completion— model, token usage if available, finish reason.tool.*— one span per tool invocation.- Follow-up LLM — another completion span if the agent loops.
Keep stable span names (llm.completion not callOpenAI) so dashboards and filters stay useful.
Naming and correlation
- Use a single trace id per user-facing request and propagate it across services (header or async context).
- Reuse
session_idacross turns in the same conversation. - Add
request_idfrom your edge layer when you have it—makes joining to HTTP logs trivial.
Privacy and compliance
- Redact at the source before data leaves your process when possible.
- Define allowlists for fields that may contain user content.
- For EU/health/financial workloads, align with your DPA: tracing can be as sensitive as application logs.
Common pitfalls
- Logging entire prompts in unstructured logs while tracing nothing structured—use spans so you can navigate the tree.
- One giant span for the whole request—you lose timing and blame assignment.
- Missing failed paths—always end spans with error state on exceptions, not only on happy paths.
Next steps
- Evaluations — turn representative traces into dataset examples and regressions tests.
- Prompt management — tie traces to specific prompt versions.