How we deliver Lighthouse Watch.
This manual defines exactly what a Watch engagement looks like from inside the engineering team — what gets done, when, by whom, against what definition of done. It is the single source of truth for delivering Watch consistently across clients.
If a client interaction or piece of work isn't in this manual, it is either (a) outside Watch scope, or (b) a candidate to be added in v1.1. Both cases trigger a Slack thread to #studio-watch-ops before the engineer commits time.
Companion to the Watch sales sheet (lighthouse-watch-sales-sheet.html) and the product spec (studio-assets/lighthouse-watch/00-product-spec.md). Sales sheet is what clients see; this is what we do.
Purpose. Why Watch is a productized service, not a retainer.
The previous "Observatory Retainer" framing positioned the post-Sprint engagement as discretionary, ad-hoc, and bespoke per client. That framing produced three failure modes: (a) low conversion (clients defaulted to "we'll handle it ourselves" and silently let systems drift), (b) non-uniform delivery (each client got a different version of the work, which is impossible to scale past 3–4 accounts), and (c) margin compression (we kept saying yes to scope creep because the contract didn't draw a perimeter).
Watch fixes all three by being productized: a fixed shape, a fixed cadence, three priced tiers, a perimeter that's written down. The trade-off is less per-client customization in exchange for the ability to deliver consistently across many clients while building toward the pattern-library moat.
This means: do not customize Watch deliverables for individual clients beyond what's allowed in the scope ladder below. If a client asks for something outside Watch, propose a Beacon Workshop or a follow-on Sprint, then escalate the request to the principal so we can decide whether it should become part of Watch v1.1.
| Request type | Disposition |
|---|---|
| Maintenance (eval, model, runbook, governance) | Inside Watch · do it as part of monthly cadence |
| Improvement to existing agent | Inside Watch · channel as a candidate +1 improvement |
| New agent or new workflow migration | Outside Watch · scope a follow-on Sprint |
| Data infra repair (CDP, warehouse, identity) | Outside Watch · diagnose, refer to client's data team or a partner |
| One-off advisory beyond Pro tier hours | Outside Watch · scope a Beacon Workshop ($18k flat) |
| Crisis / outage (P0) | Inside Watch · honor SLA. If sustained >5 days, escalate to principal for re-scoping conversation |
Engagement org. Who does what on a Watch account.
Each Watch account has three roles. On Pro tier, the lead engineer is named publicly and committed; on Standard the lead engineer is consistent but not contractually named.
| Role | Allocation | Responsibilities |
|---|---|---|
| Lead engineer | 8–10 hrs/mo (Std), 14 hrs/mo (Pro) | The +1 improvement build. Monthly review call. Stream A and B execution. Health-report draft and signature. Single point of contact for the client in Slack. |
| Eval/ops automation | Async, ~2 hrs/mo per client | Scheduled eval runs, drift detection, model-swap test orchestration. Same engineer covers all clients (multi-tenant). |
| Studio principal | 1 hr/mo per client | Quarterly architecture review attendance. Renewal conversation. Escalation point for P0 sustained >48 hrs. Sign-off on +1 candidates that touch governance. |
A single lead engineer can carry 4 Standard or 2 Pro Watch clients simultaneously without quality slipping. Pro tier consumes more capacity because of the named-engineer commitment + advisory hours + tighter SLA.
Mixed loads convert at 1 Pro = 2 Standard. So a "full" engineer is 4 Standard, or 2 Pro, or 1 Pro + 2 Standard.
Hard ceiling: when an engineer hits capacity, the next sale either triggers a hire or is queued. Do not stack a 5th account. The quality of the +1 improvement is the first thing to slip and it's also the part clients renew for.
The Watch month. What the calendar looks like, week by week.
Every Watch month follows the same shape regardless of tier. The Pro tier adds 4 hrs of advisory available on demand inside the month; the Essentials tier removes Stream C (the +1).
| Week | Owner | Output | Definition of done |
|---|---|---|---|
| Week 01 | Lead engineer + client sponsor | Health report delivered (1st of month). Standing review call (45 min). Last month's +1 confirmed. This month's +1 confirmed in scope. | Report in client inbox · review call held · +1 ticket open in Linear |
| Week 02 | Lead engineer | +1 improvement build. Eval suite expanded to cover the +1. | +1 PR open · evals committed · review requested |
| Week 03 | Lead engineer + reviewer | +1 ships behind gate. Reviewer feedback loop opens. Foundation-layer scan (model releases, deprecation watch). | +1 in production behind gate · reviewer test underway · scan log dated this cycle |
| Week 04 | Lead engineer | +1 promoted (if eval passes). Governance Runbook updated. Next-month report drafted. | +1 promoted or kicked back · runbook commit · next report 75% complete |
| Quarterly add | Studio principal + lead engineer + client sponsor | 90-min architecture review. Output: written recommendation if any structural change is warranted. | Recommendation document in client folder · agreed action items in Linear |
| Annual add | Studio principal + client sponsor | Renewal conversation. Roll into next 12-mo, step down to Essentials, step up to Pro, or scope a new Sprint. | Decision recorded · contract addendum issued or new SOW kicked off |
Why the cadence is fixed
Predictability is the value prop. The client should never wonder when the next thing happens. If your week slips, it slips publicly: send a Slack message before the date, do not let the client discover the slip in the report.
Stream A. Foundation maintenance.
Stream A is the boring part. The point is that the client never has to think about it. Four sub-tasks, all instrumented and largely automated.
Trigger: any major foundation-model release (Claude N+1, GPT N+1, Gemini N+1) within 7 business days of GA. Procedure:
- Subscribe the studio to the relevant changelog feeds. We monitor: Anthropic, OpenAI, Google, Mistral, Meta.
#studio-foundation-watchSlack channel posts on RSS hit. - For each agent in the client's deployment, generate a side-by-side eval run: incumbent model vs. candidate. Eval set = the production eval suite. Run via the model-swap harness (
tools/model-swap/). - Score: pass-rate delta, cost delta, latency delta. Produce a 1-page swap brief per agent. File in client folder under
watch/swap-briefs/YYYY-MM/. - Recommendation: stay, swap, or side-by-side (run both for one cycle to gather more data). Default to stay unless candidate model wins on at least two of the three axes by >5%.
- Surface the recommendation in the next health report. If the recommendation is swap and it impacts evals or cost by >15%, schedule a 30-min call before the report goes out.
Trigger: scheduled, monthly, week 03. Procedure:
- Run the production eval suite for every agent in deployment. Capture pass-rate per check and aggregate per agent.
- Compare to last month's results. Threshold: a regression of >3 points pass-rate on any check is automatic P1 ticket.
- Capture results in
watch/evals/YYYY-MM.jsonand surface the trend in the health report. - If any check has been failing for 3 consecutive months, propose an eval-set update (the check might be obsolete) or an agent change (the check might be revealing real drift).
This catches subtle output regressions that pass evals but feel different to reviewers. Procedure:
- Sample 50 agent outputs from the last 30 days, randomly weighted by reviewer-flagged frequency.
- Score against the original baseline samples (captured at Sprint handoff) using the drift harness. Output: drift score 0–1.
- Threshold: drift score >0.25 triggers a root-cause review. Common causes: model version silently auto-updated, prompt-template change in shared library, upstream data-shape shift.
- Surface drift score in health report regardless of threshold.
Most outages of agent systems aren't model failures — they're providers deprecating endpoints with insufficient notice. Procedure:
- Maintain a per-client dependency manifest (
watch/deps/<client>.yml): every external API the agents call, version pinned where possible. - Watch the deprecation pages of each provider weekly. Auto-poll where APIs are available.
- When a deprecation date is announced for a dependency in scope, immediately open a P1 ticket and schedule the migration within the deprecation window minus 30 days. Never finish a migration in the last 2 weeks before deprecation.
Stream B. Governance & observability.
Stream B is what keeps the system trustworthy to the humans who sign off on its outputs. It's also the part that compounds: every governance decision becomes a runbook line, every runbook line becomes a candidate pattern.
The Governance Runbook (delivered at Sprint handoff, lives in the client's repo) is a living document. It updates whenever:
- An incident happens — output that should have been blocked got through, or output that should have shipped was incorrectly held.
- A policy changes — client legal or compliance team issues a new constraint that touches agent behavior.
- A regulatory ask arrives — auditor, regulator, internal audit asks for a control we should formalize.
Revision process: PR against the runbook, reviewer (the client's GC or designee) signs off, runbook merged, all gated agents re-tested against the new check, reviewers notified to re-read.
Every gated agent has a target acceptance rate of ≥80%. Acceptance = reviewer approves the agent's output without modification. Below 80% means the agent is producing more friction than value, even if its evals pass.
- Capture acceptance rate per agent, per week, in
watch/metrics/acceptance.csv. - If 4-week rolling avg drops below 80%, open a P1 ticket: schedule a 30-min review with the client's reviewers to identify the friction.
- Common causes: prompt outdated to the use case, evals are easier than the real review, reviewer UI mismatch, scope drift in the agent's tool selection.
- Resolution may itself become a +1 improvement. If so, scope it explicitly with the client at the next monthly review.
The Reviewer UI logs every reviewer action: approve, modify, reject, escalate. Modifications and rejections are signal — not for the agent's eval suite directly, but for the eval-set composition.
- Monthly: extract the last 30 days of reviewer modifications and rejections. Cluster by failure mode.
- For any cluster with ≥5 instances, propose adding a new check to the eval suite that catches that failure mode.
- If the new check fails on the current agent, the next +1 improvement is the fix.
- Rejection categories that recur across multiple clients become candidates for the canonical Studio eval set, contributing to the library moat.
Every third month: 90-min call with client CMO/CTO sponsor + Studio principal + lead engineer. Agenda:
- Outcome metrics — has the agent moved the original Sprint metric?
- Pattern-library updates available — what could be deployed in the next quarter?
- Architecture concerns — any structural change warranted given new client priorities or new infra?
- Forward roadmap — next quarter's anchor improvements, in writing.
Output: 1-page recommendation memo, filed in client folder, referenced at the next renewal conversation.
Stream C. The +1 improvement.
The +1 improvement is the differentiator and the renewal hook. The client signs Watch · Standard or Pro because of this stream. Treating it casually destroys the entire product.
- A new pattern from
@umbra/patternsdeployed into the client's existing agent system. Example: replacing a hand-rolled gating primitive with the canonicalgovernance.human-in-the-loop-gatepattern. - A workflow optimization that cuts a measured friction. Example: collapsing a two-step approval into a single-step approval with a deferred audit trail, where reviewers have already signaled the second step is rubber-stamping.
- An eval-set expansion that catches a real-world failure mode that recurred in the reviewer feedback loop. Counts only if it produces a measurable improvement in acceptance rate.
- A new agent or new workflow — that's a Sprint.
- A bug fix — that's maintenance (Stream A or B).
- A purely cosmetic change — UI polish, copy tweaks, log formatting. Real mejoras affect outputs, gates, or trust.
- An untestable change. If we can't write an eval that distinguishes the +1 from the baseline, the +1 doesn't ship.
- Month-minus-1 review. In the previous month's review call, propose 1–3 candidate +1's with rough effort and target metric. Client picks one. Default candidate sources: reviewer feedback loop, library updates, quarterly architecture recommendations.
- Week 01 confirm. First week of the new month: confirm scope of the chosen +1 in writing (Linear ticket + Slack message to client sponsor).
- Week 02 build. Engineer implements. Eval check expanded to cover the +1's specific behavior.
- Week 03 ship-behind-gate. Deploy with feature flag or behind a gate. Reviewers see both old and new behavior in shadow mode.
- Week 04 promote-or-kick. If eval passes the threshold AND reviewer feedback signals improvement, promote. If either fails, kick to next month with a written diagnosis.
- Library contribution check. If the +1 looks like a recurring pattern (already seen in 2+ clients), tag it for promotion to
@umbra/patterns. Three matches = canonical.
The eval-gate is non-negotiable
If a +1 doesn't pass its eval, do not ship it because the client wants the count. Send a Slack message owning the slip, schedule a 15-min call, propose a smaller scope or kick to next month. Shipping a failing +1 destroys the trust that the cadence is meaningful. Take the L.
Stream D. The monthly health report.
The health report is the only deliverable many client stakeholders ever read. It is also the artifact procurement uses to justify the renewal. It must be sharp, scannable, and signed.
3–4 pages. PDF generated from the canonical template (lighthouse-watch-monthly-report.html). Sections, in order:
- Header — client name, month, version, signature line for lead engineer.
- This month at a glance — 4 KPIs (acceptance avg, evals pass-rate, +1 status, incidents). Single line each.
- Model status — versions in production, last evaluation, recommended action (per agent).
- Acceptance & eval trends — 90-day chart per agent, brief commentary on movements.
- Incidents this month — 1 line each, link to full RCA if applicable.
- The +1 improvement delivered — what we shipped, what changed, before/after metric.
- Pattern-library updates available — list of new patterns since last report, with one-line recommendations.
- Next month preview — the +1 on deck, any model migrations expected, any quarterly review scheduled.
Direct, dry, no fluff. The client's CFO might read this. No marketing language, no "we're excited to announce." If something didn't ship, say so on the first line of the relevant section. Examples:
- Yes: "+1 improvement missed eval threshold (passing 81% vs. target 90%); deferred to next cycle. Diagnosis: eval set under-covers edge case X; expanding."
- No: "We're working hard on enhancing the system and look forward to delivering even more value next month."
- Draft by Week 04, signed off by Studio principal.
- Delivered via email to client sponsor on the 1st of the month, before 9am client-local time. PDF attached + permanent link in client folder.
- Acknowledged in Slack within 48 hours. If no acknowledgment, ping. If still no acknowledgment after 5 days, principal calls — silent disengagement is a renewal-risk signal.
Escalation & SLA. When and how things escalate.
| Severity | Definition | Response | Resolve | Escalation |
|---|---|---|---|---|
| P0 · Critical | System down, evals failing across the board, governance breach, agent emitting outputs that violate policy | 4 hrs (Pro) · 1 BD (Std/Ess) | 24 hrs | Lead engineer immediately; principal notified within 4 hrs |
| P1 · High | Single agent failing, acceptance <80% for 4 weeks, model deprecation imminent | 1 BD | 5 BD | Lead engineer; principal notified at 5 BD if unresolved |
| P2 · Standard | Drift detected (within thresholds), runbook update needed, +1 scoping | Next monthly review | Within current cycle | None unless escalated by client |
| P3 · Backlog | Nice-to-haves, future tier upgrades, research items | Quarterly review | Quarterly review | None |
- Slack channel (shared, named
#watch-<client-shortname>) — first line. All routine traffic. - Email to lead engineer — escalation if Slack unread >SLA.
- Phone (Pro tier only) — direct line to lead engineer. P0 reserved.
- Principal escalation — when SLA is at risk, when a request is outside Watch scope, when client is signaling churn.
Sustained P0
If a P0 is unresolved at 48 hours, the principal joins the bridge. If unresolved at 5 days, we issue a written incident summary to the client and a service credit equal to the prorated days lost. Sustained P0s are a Watch failure, not a feature; they trigger a post-mortem in #studio-watch-ops.
Renewal & offboarding. What happens at month 11 and month 13.
One month before contract end. Studio principal + client sponsor. Agenda:
- Review the year's outcomes against the original Sprint metric.
- Show the +1 improvement ledger — every one delivered, before/after metric.
- Show the patterns inherited from the library during the year — value the client got "for free" via Watch.
- Propose the renewal: same tier, step-up to Pro, step-down to Essentials, or a new Sprint.
Default outcome target: 85% of clients renew at or above the same tier. Track in CRM.
This is signal, not failure (sometimes). Diagnosis questions:
- Did we deliver the cadence reliably? (If not — that's our miss.)
- Did the +1 improvements actually move metrics? (If not — that's a scope or selection problem.)
- Is the agent system itself still in production, or did the client retire the use case? (If retired — Watch was successful but the underlying need ended.)
- Did the client hire internal capacity to take over? (If yes — that's an outcome we should celebrate even though revenue ends.)
Document the answer. If the cause is a Studio failure, post-mortem in #studio-watch-ops within 14 days.
- Final health report covers the full final month.
- Repository handoff: client gets the canonical eval suite, the runbook, the swap briefs, all health reports, and the pattern-library deployment manifest. No ongoing access to the canonical
@umbra/patternsupdates after offboard date — those are a Watch benefit. - Slack channel archived 14 days after final report. Email contact preserved for emergency questions for 90 days.
- Client added to win-back outreach 90 days post-offboard with a check-in note from the principal.
Capacity & economics. The numbers behind the price.
This section exists so that engineers running Watch accounts understand the economic constraints behind the pricing and don't quietly let scope expand to commercially unsustainable levels.
Per Watch · Standard, monthly
Cost
Lead engineer: 8–10 hrs · Eval/ops automation: ~$200/mo · Other API and observability: ~$300/mo · Total marginal cost: ≈ $3.5k at fully-loaded engineer cost.
Per Watch · Standard, monthly
Margin
Revenue: $24k · Marginal cost: ≈ $3.5k · Gross margin: ≈ 85%. Each Standard account funds about 22% of one engineer's full-time cost.
Per engineer, monthly capacity
Standard load
4 Standard clients × $24k = $96k/mo gross revenue · ≈ $14k/mo marginal cost · ≈ $82k/mo gross profit per engineer at full Standard load.
Per engineer, monthly capacity
Pro load
2 Pro clients × $38k = $76k/mo gross revenue. Lower headline than Std-full, but higher per-client renewal probability (named engineer = stickier).
Hire a new lead engineer when sustained Watch revenue clears $60k/mo for 3 consecutive months (≈ 2.5 Standard accounts). The new engineer is brought into the rotation by shadowing the principal on 1 client for 1 quarter, then taking it over.
Do not hire ahead of the curve. Watch capacity is one of the few places we can absorb churn — under-hire by 1 client and use principal time to fill the gap.
Tooling. What instruments Watch runs on.
Watch is delivered against a small, sharp toolset. Adding new tools requires a write-up and principal sign-off — every tool added is a tool we have to maintain across all clients.
| Layer | Tool | Use |
|---|---|---|
| Issue tracking | Linear | Per-client project. P0..P3 ticket discipline. The +1 improvement is always a single Linear ticket linked to its eval check. |
| Communication | Slack shared channel | Channel name: #watch-<client-shortname>. Lead engineer is always present; principal is on every channel but only speaks when escalated. |
| Repository | Client GitHub or Studio-managed | Canonical eval suite + runbook + agent code. Watch deploys via the same pipeline that the Sprint set up. |
| Eval orchestration | LangSmith + Studio harness | All evals run in LangSmith projects scoped per-client. Studio harness wraps for cron + drift detection. |
| Observability | LangSmith traces + client's own observability | Step-level visibility per the 8 working principles. Mirror critical traces to client's tools where contractually required. |
| Model swap | Studio model-swap harness | Side-by-side eval runner. Lives at tools/model-swap/. Pluggable provider clients. |
| Reports | Pandoc / WeasyPrint | Health report PDF generated from the canonical HTML template. Client folder stores all historical reports. |
| Pattern library | @umbra/patterns | Versioned source of truth for inheritable patterns. Watch +1's are deployments of (or contributions to) this library. |
| CRM | Airtable (Watch base) | Per-client renewal status, NPS, +1 ledger, churn signals. |
- No client-facing dashboard. The health report is the dashboard. A live dashboard creates expectation of real-time monitoring that Watch doesn't promise.
- No bespoke per-client eval frameworks. Everything goes through the Studio harness. Bespoke = unscalable.
- No multi-channel client comms. Slack + scheduled email + scheduled call. Anything else is creep.