USL · Watch · Operating Manual · INTERNAL

How we deliver Lighthouse Watch.

AudienceStudio engineers

Doc classInternal runbook

CadenceMonthly

Statusv1.0 · Live

This manual defines exactly what a Watch engagement looks like from inside the engineering team — what gets done, when, by whom, against what definition of done. It is the single source of truth for delivering Watch consistently across clients.

If a client interaction or piece of work isn't in this manual, it is either (a) outside Watch scope, or (b) a candidate to be added in v1.1. Both cases trigger a Slack thread to #studio-watch-ops before the engineer commits time.

Companion to the Watch sales sheet (lighthouse-watch-sales-sheet.html) and the product spec (studio-assets/lighthouse-watch/00-product-spec.md). Sales sheet is what clients see; this is what we do.

Purpose. Why Watch is a productized service, not a retainer.

The previous "Observatory Retainer" framing positioned the post-Sprint engagement as discretionary, ad-hoc, and bespoke per client. That framing produced three failure modes: (a) low conversion (clients defaulted to "we'll handle it ourselves" and silently let systems drift), (b) non-uniform delivery (each client got a different version of the work, which is impossible to scale past 3–4 accounts), and (c) margin compression (we kept saying yes to scope creep because the contract didn't draw a perimeter).

Watch fixes all three by being productized: a fixed shape, a fixed cadence, three priced tiers, a perimeter that's written down. The trade-off is less per-client customization in exchange for the ability to deliver consistently across many clients while building toward the pattern-library moat.

This means: do not customize Watch deliverables for individual clients beyond what's allowed in the scope ladder below. If a client asks for something outside Watch, propose a Beacon Workshop or a follow-on Sprint, then escalate the request to the principal so we can decide whether it should become part of Watch v1.1.

Scope ladder · what client requests get

Request type	Disposition
Maintenance (eval, model, runbook, governance)	Inside Watch · do it as part of monthly cadence
Improvement to existing agent	Inside Watch · channel as a candidate +1 improvement
New agent or new workflow migration	Outside Watch · scope a follow-on Sprint
Data infra repair (CDP, warehouse, identity)	Outside Watch · diagnose, refer to client's data team or a partner
One-off advisory beyond Pro tier hours	Outside Watch · scope a Beacon Workshop ($18k flat)
Crisis / outage (P0)	Inside Watch · honor SLA. If sustained >5 days, escalate to principal for re-scoping conversation

Engagement org. Who does what on a Watch account.

Each Watch account has three roles. On Pro tier, the lead engineer is named publicly and committed; on Standard the lead engineer is consistent but not contractually named.

Roles

Role	Allocation	Responsibilities
Lead engineer	8–10 hrs/mo (Std), 14 hrs/mo (Pro)	The +1 improvement build. Monthly review call. Stream A and B execution. Health-report draft and signature. Single point of contact for the client in Slack.
Eval/ops automation	Async, ~2 hrs/mo per client	Scheduled eval runs, drift detection, model-swap test orchestration. Same engineer covers all clients (multi-tenant).
Studio principal	1 hr/mo per client	Quarterly architecture review attendance. Renewal conversation. Escalation point for P0 sustained >48 hrs. Sign-off on +1 candidates that touch governance.

Capacity rule

A single lead engineer can carry 4 Standard or 2 Pro Watch clients simultaneously without quality slipping. Pro tier consumes more capacity because of the named-engineer commitment + advisory hours + tighter SLA.

Mixed loads convert at 1 Pro = 2 Standard. So a "full" engineer is 4 Standard, or 2 Pro, or 1 Pro + 2 Standard.

Hard ceiling: when an engineer hits capacity, the next sale either triggers a hire or is queued. Do not stack a 5th account. The quality of the +1 improvement is the first thing to slip and it's also the part clients renew for.

The Watch month. What the calendar looks like, week by week.

Every Watch month follows the same shape regardless of tier. The Pro tier adds 4 hrs of advisory available on demand inside the month; the Essentials tier removes Stream C (the +1).

Week	Owner	Output	Definition of done
Week 01	Lead engineer + client sponsor	Health report delivered (1st of month). Standing review call (45 min). Last month's +1 confirmed. This month's +1 confirmed in scope.	Report in client inbox · review call held · +1 ticket open in Linear
Week 02	Lead engineer	+1 improvement build. Eval suite expanded to cover the +1.	+1 PR open · evals committed · review requested
Week 03	Lead engineer + reviewer	+1 ships behind gate. Reviewer feedback loop opens. Foundation-layer scan (model releases, deprecation watch).	+1 in production behind gate · reviewer test underway · scan log dated this cycle
Week 04	Lead engineer	+1 promoted (if eval passes). Governance Runbook updated. Next-month report drafted.	+1 promoted or kicked back · runbook commit · next report 75% complete
Quarterly add	Studio principal + lead engineer + client sponsor	90-min architecture review. Output: written recommendation if any structural change is warranted.	Recommendation document in client folder · agreed action items in Linear
Annual add	Studio principal + client sponsor	Renewal conversation. Roll into next 12-mo, step down to Essentials, step up to Pro, or scope a new Sprint.	Decision recorded · contract addendum issued or new SOW kicked off

Why the cadence is fixed

Predictability is the value prop. The client should never wonder when the next thing happens. If your week slips, it slips publicly: send a Slack message before the date, do not let the client discover the slip in the report.

Stream A. Foundation maintenance.

Stream A is the boring part. The point is that the client never has to think about it. Four sub-tasks, all instrumented and largely automated.

A.1 · Model swap testing

Trigger: any major foundation-model release (Claude N+1, GPT N+1, Gemini N+1) within 7 business days of GA. Procedure:

Subscribe the studio to the relevant changelog feeds. We monitor: Anthropic, OpenAI, Google, Mistral, Meta. #studio-foundation-watch Slack channel posts on RSS hit.
For each agent in the client's deployment, generate a side-by-side eval run: incumbent model vs. candidate. Eval set = the production eval suite. Run via the model-swap harness (tools/model-swap/).
Score: pass-rate delta, cost delta, latency delta. Produce a 1-page swap brief per agent. File in client folder under watch/swap-briefs/YYYY-MM/.
Recommendation: stay, swap, or side-by-side (run both for one cycle to gather more data). Default to stay unless candidate model wins on at least two of the three axes by >5%.
Surface the recommendation in the next health report. If the recommendation is swap and it impacts evals or cost by >15%, schedule a 30-min call before the report goes out.

A.2 · Eval suite re-run

Trigger: scheduled, monthly, week 03. Procedure:

Run the production eval suite for every agent in deployment. Capture pass-rate per check and aggregate per agent.
Compare to last month's results. Threshold: a regression of >3 points pass-rate on any check is automatic P1 ticket.
Capture results in watch/evals/YYYY-MM.json and surface the trend in the health report.
If any check has been failing for 3 consecutive months, propose an eval-set update (the check might be obsolete) or an agent change (the check might be revealing real drift).

A.3 · Prompt-drift detection

This catches subtle output regressions that pass evals but feel different to reviewers. Procedure:

Sample 50 agent outputs from the last 30 days, randomly weighted by reviewer-flagged frequency.
Score against the original baseline samples (captured at Sprint handoff) using the drift harness. Output: drift score 0–1.
Threshold: drift score >0.25 triggers a root-cause review. Common causes: model version silently auto-updated, prompt-template change in shared library, upstream data-shape shift.
Surface drift score in health report regardless of threshold.

A.4 · Provider deprecation tracking

Most outages of agent systems aren't model failures — they're providers deprecating endpoints with insufficient notice. Procedure:

Maintain a per-client dependency manifest (watch/deps/<client>.yml): every external API the agents call, version pinned where possible.
Watch the deprecation pages of each provider weekly. Auto-poll where APIs are available.
When a deprecation date is announced for a dependency in scope, immediately open a P1 ticket and schedule the migration within the deprecation window minus 30 days. Never finish a migration in the last 2 weeks before deprecation.

Stream B. Governance & observability.

Stream B is what keeps the system trustworthy to the humans who sign off on its outputs. It's also the part that compounds: every governance decision becomes a runbook line, every runbook line becomes a candidate pattern.

B.1 · Governance Runbook revisions

The Governance Runbook (delivered at Sprint handoff, lives in the client's repo) is a living document. It updates whenever:

An incident happens — output that should have been blocked got through, or output that should have shipped was incorrectly held.
A policy changes — client legal or compliance team issues a new constraint that touches agent behavior.
A regulatory ask arrives — auditor, regulator, internal audit asks for a control we should formalize.

Revision process: PR against the runbook, reviewer (the client's GC or designee) signs off, runbook merged, all gated agents re-tested against the new check, reviewers notified to re-read.

B.2 · Acceptance-rate monitoring

Every gated agent has a target acceptance rate of ≥80%. Acceptance = reviewer approves the agent's output without modification. Below 80% means the agent is producing more friction than value, even if its evals pass.

Capture acceptance rate per agent, per week, in watch/metrics/acceptance.csv.
If 4-week rolling avg drops below 80%, open a P1 ticket: schedule a 30-min review with the client's reviewers to identify the friction.
Common causes: prompt outdated to the use case, evals are easier than the real review, reviewer UI mismatch, scope drift in the agent's tool selection.
Resolution may itself become a +1 improvement. If so, scope it explicitly with the client at the next monthly review.

B.3 · Reviewer UI feedback loop

The Reviewer UI logs every reviewer action: approve, modify, reject, escalate. Modifications and rejections are signal — not for the agent's eval suite directly, but for the eval-set composition.

Monthly: extract the last 30 days of reviewer modifications and rejections. Cluster by failure mode.
For any cluster with ≥5 instances, propose adding a new check to the eval suite that catches that failure mode.
If the new check fails on the current agent, the next +1 improvement is the fix.
Rejection categories that recur across multiple clients become candidates for the canonical Studio eval set, contributing to the library moat.

B.4 · Quarterly architecture review

Every third month: 90-min call with client CMO/CTO sponsor + Studio principal + lead engineer. Agenda:

Outcome metrics — has the agent moved the original Sprint metric?
Pattern-library updates available — what could be deployed in the next quarter?
Architecture concerns — any structural change warranted given new client priorities or new infra?
Forward roadmap — next quarter's anchor improvements, in writing.

Output: 1-page recommendation memo, filed in client folder, referenced at the next renewal conversation.

Stream C. The +1 improvement.

The +1 improvement is the differentiator and the renewal hook. The client signs Watch · Standard or Pro because of this stream. Treating it casually destroys the entire product.

What qualifies as a +1

A new pattern from @umbra/patterns deployed into the client's existing agent system. Example: replacing a hand-rolled gating primitive with the canonical governance.human-in-the-loop-gate pattern.
A workflow optimization that cuts a measured friction. Example: collapsing a two-step approval into a single-step approval with a deferred audit trail, where reviewers have already signaled the second step is rubber-stamping.
An eval-set expansion that catches a real-world failure mode that recurred in the reviewer feedback loop. Counts only if it produces a measurable improvement in acceptance rate.

What does NOT qualify

A new agent or new workflow — that's a Sprint.
A bug fix — that's maintenance (Stream A or B).
A purely cosmetic change — UI polish, copy tweaks, log formatting. Real mejoras affect outputs, gates, or trust.
An untestable change. If we can't write an eval that distinguishes the +1 from the baseline, the +1 doesn't ship.

Pipeline

Month-minus-1 review. In the previous month's review call, propose 1–3 candidate +1's with rough effort and target metric. Client picks one. Default candidate sources: reviewer feedback loop, library updates, quarterly architecture recommendations.
Week 01 confirm. First week of the new month: confirm scope of the chosen +1 in writing (Linear ticket + Slack message to client sponsor).
Week 02 build. Engineer implements. Eval check expanded to cover the +1's specific behavior.
Week 03 ship-behind-gate. Deploy with feature flag or behind a gate. Reviewers see both old and new behavior in shadow mode.
Week 04 promote-or-kick. If eval passes the threshold AND reviewer feedback signals improvement, promote. If either fails, kick to next month with a written diagnosis.
Library contribution check. If the +1 looks like a recurring pattern (already seen in 2+ clients), tag it for promotion to @umbra/patterns. Three matches = canonical.

The eval-gate is non-negotiable

If a +1 doesn't pass its eval, do not ship it because the client wants the count. Send a Slack message owning the slip, schedule a 15-min call, propose a smaller scope or kick to next month. Shipping a failing +1 destroys the trust that the cadence is meaningful. Take the L.

Stream D. The monthly health report.

The health report is the only deliverable many client stakeholders ever read. It is also the artifact procurement uses to justify the renewal. It must be sharp, scannable, and signed.

Format spec

3–4 pages. PDF generated from the canonical template (lighthouse-watch-monthly-report.html). Sections, in order:

Header — client name, month, version, signature line for lead engineer.
This month at a glance — 4 KPIs (acceptance avg, evals pass-rate, +1 status, incidents). Single line each.
Model status — versions in production, last evaluation, recommended action (per agent).
Acceptance & eval trends — 90-day chart per agent, brief commentary on movements.
Incidents this month — 1 line each, link to full RCA if applicable.
The +1 improvement delivered — what we shipped, what changed, before/after metric.
Pattern-library updates available — list of new patterns since last report, with one-line recommendations.
Next month preview — the +1 on deck, any model migrations expected, any quarterly review scheduled.

Voice

Direct, dry, no fluff. The client's CFO might read this. No marketing language, no "we're excited to announce." If something didn't ship, say so on the first line of the relevant section. Examples:

Yes: "+1 improvement missed eval threshold (passing 81% vs. target 90%); deferred to next cycle. Diagnosis: eval set under-covers edge case X; expanding."
No: "We're working hard on enhancing the system and look forward to delivering even more value next month."

Cadence

Draft by Week 04, signed off by Studio principal.
Delivered via email to client sponsor on the 1st of the month, before 9am client-local time. PDF attached + permanent link in client folder.
Acknowledged in Slack within 48 hours. If no acknowledgment, ping. If still no acknowledgment after 5 days, principal calls — silent disengagement is a renewal-risk signal.

Escalation & SLA. When and how things escalate.

Severity	Definition	Response	Resolve	Escalation
P0 · Critical	System down, evals failing across the board, governance breach, agent emitting outputs that violate policy	4 hrs (Pro) · 1 BD (Std/Ess)	24 hrs	Lead engineer immediately; principal notified within 4 hrs
P1 · High	Single agent failing, acceptance <80% for 4 weeks, model deprecation imminent	1 BD	5 BD	Lead engineer; principal notified at 5 BD if unresolved
P2 · Standard	Drift detected (within thresholds), runbook update needed, +1 scoping	Next monthly review	Within current cycle	None unless escalated by client
P3 · Backlog	Nice-to-haves, future tier upgrades, research items	Quarterly review	Quarterly review	None

Channel hierarchy

Slack channel (shared, named #watch-<client-shortname>) — first line. All routine traffic.
Email to lead engineer — escalation if Slack unread >SLA.
Phone (Pro tier only) — direct line to lead engineer. P0 reserved.
Principal escalation — when SLA is at risk, when a request is outside Watch scope, when client is signaling churn.

Sustained P0

If a P0 is unresolved at 48 hours, the principal joins the bridge. If unresolved at 5 days, we issue a written incident summary to the client and a service credit equal to the prorated days lost. Sustained P0s are a Watch failure, not a feature; they trigger a post-mortem in #studio-watch-ops.

Renewal & offboarding. What happens at month 11 and month 13.

The renewal call (month 11)

One month before contract end. Studio principal + client sponsor. Agenda:

Review the year's outcomes against the original Sprint metric.
Show the +1 improvement ledger — every one delivered, before/after metric.
Show the patterns inherited from the library during the year — value the client got "for free" via Watch.
Propose the renewal: same tier, step-up to Pro, step-down to Essentials, or a new Sprint.

Default outcome target: 85% of clients renew at or above the same tier. Track in CRM.

If the client downgrades or doesn't renew

This is signal, not failure (sometimes). Diagnosis questions:

Did we deliver the cadence reliably? (If not — that's our miss.)
Did the +1 improvements actually move metrics? (If not — that's a scope or selection problem.)
Is the agent system itself still in production, or did the client retire the use case? (If retired — Watch was successful but the underlying need ended.)
Did the client hire internal capacity to take over? (If yes — that's an outcome we should celebrate even though revenue ends.)

Document the answer. If the cause is a Studio failure, post-mortem in #studio-watch-ops within 14 days.

Offboarding procedure

Final health report covers the full final month.
Repository handoff: client gets the canonical eval suite, the runbook, the swap briefs, all health reports, and the pattern-library deployment manifest. No ongoing access to the canonical @umbra/patterns updates after offboard date — those are a Watch benefit.
Slack channel archived 14 days after final report. Email contact preserved for emergency questions for 90 days.
Client added to win-back outreach 90 days post-offboard with a check-in note from the principal.

Capacity & economics. The numbers behind the price.

This section exists so that engineers running Watch accounts understand the economic constraints behind the pricing and don't quietly let scope expand to commercially unsustainable levels.

Per Watch · Standard, monthly

Cost

Lead engineer: 8–10 hrs · Eval/ops automation: ~$200/mo · Other API and observability: ~$300/mo · Total marginal cost: ≈ $3.5k at fully-loaded engineer cost.

Per Watch · Standard, monthly

Margin

Revenue: $24k · Marginal cost: ≈ $3.5k · Gross margin: ≈ 85%. Each Standard account funds about 22% of one engineer's full-time cost.

Per engineer, monthly capacity

Standard load

4 Standard clients × $24k = $96k/mo gross revenue · ≈ $14k/mo marginal cost · ≈ $82k/mo gross profit per engineer at full Standard load.

Per engineer, monthly capacity

Pro load

2 Pro clients × $38k = $76k/mo gross revenue. Lower headline than Std-full, but higher per-client renewal probability (named engineer = stickier).

When to add capacity

Hire a new lead engineer when sustained Watch revenue clears $60k/mo for 3 consecutive months (≈ 2.5 Standard accounts). The new engineer is brought into the rotation by shadowing the principal on 1 client for 1 quarter, then taking it over.

Do not hire ahead of the curve. Watch capacity is one of the few places we can absorb churn — under-hire by 1 client and use principal time to fill the gap.

Tooling. What instruments Watch runs on.

Watch is delivered against a small, sharp toolset. Adding new tools requires a write-up and principal sign-off — every tool added is a tool we have to maintain across all clients.

Layer	Tool	Use
Issue tracking	Linear	Per-client project. P0..P3 ticket discipline. The +1 improvement is always a single Linear ticket linked to its eval check.
Communication	Slack shared channel	Channel name: `#watch-<client-shortname>`. Lead engineer is always present; principal is on every channel but only speaks when escalated.
Repository	Client GitHub or Studio-managed	Canonical eval suite + runbook + agent code. Watch deploys via the same pipeline that the Sprint set up.
Eval orchestration	LangSmith + Studio harness	All evals run in LangSmith projects scoped per-client. Studio harness wraps for cron + drift detection.
Observability	LangSmith traces + client's own observability	Step-level visibility per the 8 working principles. Mirror critical traces to client's tools where contractually required.
Model swap	Studio model-swap harness	Side-by-side eval runner. Lives at `tools/model-swap/`. Pluggable provider clients.
Reports	Pandoc / WeasyPrint	Health report PDF generated from the canonical HTML template. Client folder stores all historical reports.
Pattern library	`@umbra/patterns`	Versioned source of truth for inheritable patterns. Watch +1's are deployments of (or contributions to) this library.
CRM	Airtable (Watch base)	Per-client renewal status, NPS, +1 ledger, churn signals.

What we deliberately don't use

No client-facing dashboard. The health report is the dashboard. A live dashboard creates expectation of real-time monitoring that Watch doesn't promise.
No bespoke per-client eval frameworks. Everything goes through the Studio harness. Bespoke = unscalable.
No multi-channel client comms. Slack + scheduled email + scheduled call. Anything else is creep.