Documentation/Product/Tracing
3 min read

Tracing

Tracing in Boson gives you an end-to-end view of a single run—from the first user message through retrieval, tool calls, model completions, and errors. It is the fastest way to answer “what happened?” when behavior is non-deterministic or spread across many steps.

If you have not sent a trace yet, start with Installation.

Why tracing matters for LLM apps

  • Multi-step flows hide failures: a bad retrieval or tool argument may only show up deep in the chain.
  • Non-determinism makes logs alone insufficient—you need structured context (inputs, outputs, timings) per step.
  • Production debugging needs the same shape as development: one trace per request, comparable across environments.

Core concepts

ConceptMeaning
TraceOne logical run (e.g. one API request, one chat turn, one job).
SpanA unit of work inside that trace (e.g. retrieval, llm.completion, tool.execute).
EventA point-in-time payload on a span (e.g. input, output, metadata).

Aim for fewer, meaningful spans rather than a span per line of code.

What to capture

Always worth recording

  • Model and provider (e.g. model id, API version if relevant).
  • Latency per span (SDKs often do this automatically when you wrap calls).
  • Errors and retries with the exception message and whether the call was retried.
  • High-signal metadata: environment, release or git SHA, feature flags, user_id / session_id only if policy allows.

Inputs and outputs

  • Store enough to debug (prompts, tool args, retrieved chunks summaries)—but redact or hash secrets, tokens, PII, and full document bodies if your policy requires it.
  • Prefer structured fields (JSON) over huge opaque strings when you need to filter later.

Agents and tools

For each tool call, capture:

  • Tool name and arguments (sanitized).
  • Result or error (truncated if large).
  • Parent span so the tree matches the mental model of your agent.

A typical chat or agent request might look like:

  1. Root spantrace or request with request-level metadata.
  2. retrieval — query, source, number of hits, latency.
  3. llm.completion — model, token usage if available, finish reason.
  4. tool.* — one span per tool invocation.
  5. Follow-up LLM — another completion span if the agent loops.

Keep stable span names (llm.completion not callOpenAI) so dashboards and filters stay useful.

Naming and correlation

  • Use a single trace id per user-facing request and propagate it across services (header or async context).
  • Reuse session_id across turns in the same conversation.
  • Add request_id from your edge layer when you have it—makes joining to HTTP logs trivial.

Privacy and compliance

  • Redact at the source before data leaves your process when possible.
  • Define allowlists for fields that may contain user content.
  • For EU/health/financial workloads, align with your DPA: tracing can be as sensitive as application logs.

Common pitfalls

  • Logging entire prompts in unstructured logs while tracing nothing structured—use spans so you can navigate the tree.
  • One giant span for the whole request—you lose timing and blame assignment.
  • Missing failed paths—always end spans with error state on exceptions, not only on happy paths.

Next steps

  • Evaluations — turn representative traces into dataset examples and regressions tests.
  • Prompt management — tie traces to specific prompt versions.