Prompt management

Prompts are part of your system: they should be versioned, reviewable, and deployable with the same discipline as backend code. Boson prompt management helps you store templates, attach them to environments, and roll forward—or roll back—when evals or production signal a problem.

This pairs directly with Tracing (know which version ran) and Evaluations (prove a version is better).

Why version prompts

Reproducibility — “what did we send to the model last Tuesday?”
Collaboration — PR-style review instead of shared Google Docs.
Safe rollback — one click when a deploy hurts quality.
Audit — who changed what, when, and in which environment.

Prompts as templates

Treat each prompt as a template with variables, not a hard-coded string in code.

Variables

User content — message, attachments metadata.
System context — locale, plan tier, feature flags.
Retrieved context — RAG snippets (with clear delimiters in the template).

Keep variable names stable across versions so code does not churn when copy changes.

Structure

Prefer a system block for rules and constraints.
Use a user block for the actual task and dynamic content.
Document invariants (“never invent URLs”, “cite sources”) in system text reviewers can see.

Environments

Map prompts to environments the same way you map APIs:

Environment	Typical use
Development	Fast iteration, noisy experiments OK.
Staging	Mirror prod config; run full eval suites here.
Production	Only promoted, reviewed versions.

Promotion should be explicit (button or API), not “whoever edited last.”

Lifecycle: draft → review → promote

Draft a new version (copy change, model instruction tweak, new variable).
Run evals against your pinned dataset; compare to the current production prompt.
Review with a teammate for policy, brand, and failure modes.
Promote to staging, then production after checks pass.

Connecting prompts to traces

When you log traces, record prompt version id (and optionally template hash) on the LLM span. That makes it possible to:

Correlate spikes in errors with a specific prompt deploy.
Slice metrics by version in analytics.

Rollback

If production quality drops after a prompt change:

Revert to the last known-good version immediately.
Preserve the bad version for postmortem and eval additions.
Add failing cases to your dataset so the mistake cannot recur silently.

Access and governance

Restrict who can edit production-linked prompts.
Keep read access broad for support and PMs where appropriate.
Export or backup critical templates for disaster recovery.

Common pitfalls

Prompts only in code — you lose history and non-engineer collaboration.
No eval gate — copy changes ship without measurement.
Unbounded variables — missing validation lets huge context blow token limits or leak data.

Next steps

Evaluations — gate every prompt promotion on metrics.
Tracing — tag spans with prompt version for production forensics.