Governance Operations.
Cross-phase operational manual for Lighthouse Sprint governance. Covers the six governance components, the spreadsheet-as-state-machine pattern, severity levels, alerting and escalation, rollback procedures, audit trail hygiene, and incident response discipline — the machinery that makes agent systems safe to run in production.
Paired with the Governance Runbook template. Read on Day 1 of the sprint; kept open through Handoff and the 30-day support window. Owned by the Governance Architect; read by every Studio role and the client Technical Counterpart.
How to use this manual.
Governance Operations is cross-phase. Unlike the Discovery, Redesign, Build, and Handoff playbooks — each of which covers a specific phase — this manual runs horizontally across every phase of every sprint. It is the discipline that makes agent systems safe to run on a client’s production data.
The manual is written for the Governance Architect, who owns the work described here end to end, and is mandatory reading for the Sprint Lead (who approves and audits it), the Agent Engineer (who instruments what it specifies), and the client Technical Counterpart (who inherits the instrumentation at Gate 4).
It is not a design manual — the Redesign Playbook covers the design of the wrapper. This manual is the operational side: how the six components behave, how severity is reasoned about, how rollback is performed, how the audit log is kept clean. Everything in this manual should still be true six months after Gate 4.
Where this manual shows up in the sprint
| Phase | Governance Operations is used for |
|---|---|
| Discovery | Reviewed by the Governance Architect to anticipate which components the system will need; informs the Fit Assessment Rubric’s governance criteria. |
| Redesign | Read alongside the Redesign Playbook §07. Defines the vocabulary for the Governance Wrapper design; informs per-agent governance mapping. |
| Build | Primary operational reference during the 4-week governance build-out (Weeks 5–8). Defines what “instrumented” means for each component. |
| Handoff | Basis of the Week-10 governance drills; becomes the Workflow Owner’s reference after Gate 4. The Operations Manual cites this document for deep governance questions. |
| Support window | When a client asks a question about governance in the 30-day support window, the answer lives somewhere in this manual. |
Governance at a glance.
Umbra Studio’s governance architecture has three layers: the agent layer, the state layer, and the governance wrapper layer. The wrapper is composed of six components, each with a specific responsibility. The state layer — the spreadsheet-as-state-machine (USP-009) — is the shared surface between humans and agents.
The three layers
The agent layer.
Every agent has a spec — mission, trigger, inputs, decision rules, outputs, failure modes, success criteria, governance. The agent layer is the work; it is everything that writes outputs.
The state spreadsheet.
The canonical record of what exists, what is pending, what was approved. Writeable in narrow columns by humans; writeable in broader columns by agents. Editable in the same UI both parties already use. USP-009.
The governance wrapper.
Six components that sit around the agents: Monitoring, Alerting, Audit Trail, Escalation, Override, Rollback. The wrapper has no business logic; it has only supervisory logic.
The six components
| Component | Answers the question | Primary audience |
|---|---|---|
| Monitoring | How is the system behaving, right now? | Governance Architect, Workflow Owner |
| Alerting | Who needs to know, and how quickly? | Workflow Owner, Technical Counterpart |
| Audit Trail | What happened, and when, and why? | Workflow Owner (day 1 forensics) |
| Escalation | Who handles this when the first responder is stuck? | Executive Sponsor, Sprint Lead |
| Override | How do I stop a specific agent without stopping the system? | Workflow Owner |
| Rollback | How do I restore a known-good state? | Workflow Owner, Technical Counterpart |
The six components.
Monitoring answers: “How is the system behaving right now?” It reads the audit trail and surfaces the signal humans need to judge system health without reading raw logs. Dashboards; not alerts — alerts are a separate component.
What Monitoring tracks
- Per-agent throughput — invocations per hour, with trailing 24-hour comparison.
- Per-agent error rate — failed / total, rolling 24-hour window.
- Per-agent latency — p50 / p95 invocation duration.
- End-to-end pipeline latency — from the first agent’s input to the last agent’s output, if the roster runs as a pipeline.
- State spreadsheet growth rate — rows added per day per sheet.
- Quality score — if the system has a scorable output (e.g. editorial review score), rolling average and delta from baseline.
“Done” definition
Monitoring is done when: (a) every agent has throughput, error, and latency tiles visible on a dashboard; (b) the dashboard is bookmarked on the Workflow Owner’s home tab; (c) the Workflow Owner can answer “how did the system do yesterday?” in under 60 seconds.
Common pitfalls
- Dashboards only the Governance Architect can read. If the Workflow Owner can’t read it, it is not a dashboard — it is vanity. Build the view Workflow-Owner-first.
- Too many tiles. Twelve tiles is a dashboard; sixty is a maze. Cut mercilessly.
Alerting answers: “Who needs to know, and how quickly?” It is the active-notification complement to Monitoring’s passive dashboards. Alerts are severity-routed (SEV-1 / SEV-2 / SEV-3; see §05) and go through specific channels to specific humans.
Alert rule anatomy
- Name — human-legible; “Editorial Writer quality score fell below 0.70 for 1 hour” not “ALR-014.”
- Condition — the measurable rule the alert fires on. Specific thresholds, windowed.
- Severity — SEV-1, 2, or 3 per §05.
- Routing — which human(s) receive it, through which channel.
- Suppression — any conditions under which the rule should be silenced (maintenance window, known-issue flag).
- Runbook link — a link into this manual or the Operations Manual describing the response procedure.
“Done” definition
Alerting is done when: (a) every agent has at least one SEV-2 rule; (b) every SEV level has been fired end-to-end at least once during Build (test or real); (c) the Workflow Owner has personally received at least one test alert and confirmed the delivery channel.
Common pitfalls
- Alert fatigue. More than 5 alerts per day desensitises the responder. Tune thresholds or consolidate rules. An ignored alert is worse than no alert.
- Routing to a team channel that no one reads. Alerts go to specific humans, not group inboxes. Group inboxes are for summaries, not alerts.
Audit Trail answers: “What happened, and when, and why?” It is the immutable, append-only record of every agent invocation and every material human intervention. It is the ground truth that Monitoring summarises and that post-mortems refer to.
Required fields per invocation
- Timestamp — ISO-8601 with timezone.
- Agent name — human-legible.
- Agent version — git SHA or semver.
- Input hash — deterministic hash of the input payload.
- Output hash — deterministic hash of the output payload.
- Outcome —
success,failure,partial,skipped. - Duration — end-to-end wall-clock in ms.
- Trigger — what caused this invocation (schedule, spreadsheet change, manual, upstream agent).
- Error — if outcome is failure, the error class and one-line message.
- Correlation ID — a shared ID across agents in the same pipeline execution, so multi-agent flows can be reconstructed.
“Done” definition
Audit Trail is done when: (a) every agent writes every required field every invocation; (b) the Workflow Owner can answer “show me every invocation of agent X last Tuesday” in under 3 minutes, unassisted; (c) the log is append-only — no edits or deletions without a Studio-principal signature.
Common pitfalls
- Missing correlation IDs. A multi-agent flow without a correlation ID is unreconstructible. Wire it on Day 22.
- Timezone drift. Always log UTC in the raw record, render client-local in the UI. Mixing timezones is a post-mortem-killer.
- Mutable logs. A log that can be edited is not an audit trail. Use append-only storage — typically a spreadsheet tab with no delete permissions.
Escalation answers: “Who handles this when the first responder is stuck?” It is the path a live incident takes when the initial responder cannot resolve it within a bounded window. Escalation is a human protocol; the system merely enforces it.
The three-tier escalation ladder
- Tier 1 (First responder) — typically Workflow Owner. Window: 15 minutes for SEV-2, 5 minutes for SEV-1. If unresolved, escalate to Tier 2.
- Tier 2 (Technical escalate) — Technical Counterpart (in-engagement) or Executive Sponsor (post-handoff). Window: 60 minutes. If unresolved, escalate to Tier 3.
- Tier 3 (Strategic escalate) — Executive Sponsor and Sprint Lead (in-engagement) or Executive Sponsor and Studio (post-handoff). No further tier — this is the decision authority.
“Done” definition
Escalation is done when: (a) every SEV level has a named first responder and an explicit escalation path written down in the Runbook; (b) every person on the ladder has explicitly acknowledged they are on the ladder; (c) at least one SEV-2 has been escalated end-to-end during Build (test or real).
Common pitfalls
- Escalation ladder with the same person at two tiers. Not an escalation ladder. Fix.
- Tier 2 who didn’t know they were Tier 2. Escalation requires consent; get it on the record.
Override answers: “How do I stop a specific agent without stopping the system?” It gives the Workflow Owner fine-grained control to pause, resume, or redirect individual agents, without taking down the whole roster. Implemented through the state-spreadsheet pattern (USP-009) as a human-writable column.
Override primitives
- Pause — agent stops accepting new invocations; in-flight invocations finish.
- Resume — agent accepts invocations again.
- Force-failure — in-flight invocation marked failed; audit row written; downstream agents respect the failure.
- Skip-next — next scheduled invocation skipped without pausing the agent longer-term.
- Redirect — route to a different output destination (e.g. draft status instead of published) for a bounded window.
“Done” definition
Override is done when: (a) every agent is individually pausable and resumable from the state spreadsheet; (b) the pause is reflected in the audit trail as a human intervention with timestamp and signer; (c) the Workflow Owner has demonstrably paused and resumed at least one agent, solo, during Week 10.
Common pitfalls
- Global kill switch masquerading as Override. A big red button that stops everything is crisis tooling, not Override. Override is targeted.
- Pause that isn’t logged. Every override is an audit event. No exceptions.
Rollback answers: “How do I restore a known-good state?” It is the most consequential of the six components — the only one that can destroy work if misused, and also the only one that can save an engagement from a catastrophic state event. Procedures in §07.
Rollback primitives
- Snapshot — a point-in-time copy of the state spreadsheet plus all dependent stores.
- Restore — replace current state with a named snapshot; preserves audit trail.
- Replay — re-run agent invocations from a timestamp, given the restored state (rarely used; high-cost).
- Partial rollback — restore only specific tabs / rows / rows-matching-criterion; requires Studio-principal signature.
“Done” definition
Rollback is done when: (a) at least hourly snapshots run automatically during active hours; (b) snapshots are stored in a separate location from the primary state; (c) restore is demonstrated live, end-to-end, in under 10 minutes during the Week-8 incident response drill; (d) the Workflow Owner can initiate a restore, solo, using the Operations Manual as the only reference.
Common pitfalls
- Snapshots stored in the same spreadsheet they back up. A fire destroys both. Separate store, separate credentials.
- Snapshots never tested. An untested snapshot is not a snapshot — it is a hope. Drill.
The spreadsheet as state machine.
USP-009 is Umbra Studio’s highest-utility pattern. It says: use a structured spreadsheet as the canonical state store for a multi-agent system that must share authority with humans. Not a database with an admin UI; not a backing store behind a custom panel — the spreadsheet is the UI, and it is the store, and the agents read and write it directly through its API.
The pattern works because humans already know how to read and edit spreadsheets. Building a custom admin panel is expensive, takes design time, requires training, and creates a second source of truth. USP-009 short-circuits all of that. It is not appropriate for every engagement, but where it applies it can collapse weeks of admin-surface work into hours.
When to reach for USP-009
| Fit signal | Explanation |
|---|---|
| Workflow Owner is already editing a spreadsheet | If the pre-agent workflow uses a spreadsheet as the source of truth, continue the metaphor rather than replace it. |
| State fits a tabular model | Rows = items, columns = attributes, one sheet per category. If the data is a graph or deeply nested, USP-009 is the wrong pattern. |
| Data volume < ~100K rows | Spreadsheets slow beyond ~100K rows. For high-volume systems, use a database with a thin admin layer. |
| Human edit frequency is moderate | If humans never edit the state, pure-database is simpler. If they edit constantly, a custom UI is probably needed. USP-009 is for moderate, context-aware editing. |
| Governance needs visibility | A spreadsheet you can see beats a black box you cannot. Regulators, auditors, and sponsors understand spreadsheets. |
Column discipline — the rules that make USP-009 work
- Human-only columns — agents never write these. Approval flags, priority overrides, notes, pauses. Typically 3–7 columns per sheet.
- Agent-only columns — humans never write these (outside documented edge cases). Status, timestamps, hashes, URLs. Typically 8–15 columns per sheet.
- Schema lock by Week 5 end. Post-Week-5 schema changes require Studio-principal signature and an audit-log entry naming the change and rationale.
- No merged cells, no formulas agents depend on. Agents parse the sheet deterministically; merged cells and stateful formulas break parsers.
- Every row has a stable ID. UUID or monotonic sequence — never derive identity from a mutable field (like a URL that might change).
- Status columns use a closed vocabulary. Not free text — not free text, not free text. Document the vocabulary in the Operations Manual.
- History preserved via append, not update. Status changes append to a history tab; the main sheet shows current state only.
- Read-only columns are hard-read-only. Protected via sheet permissions, not just convention.
Typical sheet layout
| Tab | Contents | Editable by |
|---|---|---|
| Queue | The live work items — one row per item, columns for status, priority, timestamps, URLs | Mixed (column-bounded) |
| Config | Agent configuration — rate limits, thresholds, feature flags | Workflow Owner + Agent Engineer |
| History | Append-only event log — every status change, agent invocation, override | Agents (append) + read-only for humans |
| Overrides | Per-agent pause / resume / skip directives | Workflow Owner only |
| Alerts | Recent alerts for dashboard context | Agents (append) + read-only for humans |
| Archive | Closed items moved out of Queue after N days | Agents (append) + read-only for humans |
Severity levels.
Three severity levels, each with a specific definition, a specific response window, and a specific routing path. Never invent a fourth level — if an event doesn’t fit one of three, it is not an incident. The entire rest of the governance stack (alerting, escalation, rollback, audit) depends on these definitions being stable.
Data at risk.
Definition: The system has produced or is about to produce output that would be unrecoverable or reputation-damaging if not intercepted. Published content that is wrong. Data writes that cannot be undone without manual labour.
Response: Immediate. Pause the responsible agent; initiate rollback if state was written; notify Sprint Lead and Workflow Owner within 5 minutes; Tier-1 first responder.
Examples: Review published with factual errors. State spreadsheet rows deleted or corrupted. Incorrect API call to a write-only third-party surface.
Quality breach.
Definition: The system is producing output below the agreed quality threshold but state is intact and nothing has shipped publicly. Rework is possible without labour explosion.
Response: Within 15 minutes. Notify Workflow Owner through primary channel; pause affected agent if quality drop is sustained; investigation may wait for next business hour.
Examples: Editorial quality score drops below baseline for 1 hour. Cover art retrieval returning wrong pressings 40% of the time.
Degradation.
Definition: The system is operational but behaving sub-optimally. Elevated failure rates, latency spikes, partial outages of dependencies with retry-succeeded recovery.
Response: Within 60 minutes or next business hour. Log as a support ticket; no agent pause required; investigated as part of routine operations.
Examples: Spotify API 5xx rate elevated but retries succeed. Cycle time degraded from 8 min to 18 min without other quality impact.
Severity classification rules
- Classify on evidence, not intuition. The alert rule names the severity; human reclassification is logged in the audit trail with rationale.
- Escalate, never de-escalate silently. A SEV-2 can be promoted to SEV-1 mid-incident; de-escalation happens only after the incident closes and the post-mortem confirms the lesser severity.
- One severity per incident. If two agents are failing simultaneously, that is two incidents, not one SEV-1-plus-a-SEV-3. Each has its own audit trail.
- SEV-1 triggers a post-mortem regardless of outcome; SEV-2 triggers a post-mortem if response exceeded 30 minutes; SEV-3 is post-mortemed only at monthly review unless a pattern of repeat SEV-3s is observed.
- False positives are tracked. Every alert that fires but turns out not to be real is logged; alert thresholds are tuned at monthly review.
Alerting & escalation matrix.
The matrix is the single artifact that ties severity levels, components, humans, and channels together. It must be readable in under a minute, kept up to date, and referenced by every alert rule. Re-read the matrix at every Friday demo and update if any line item has changed.
Default matrix (adapt per engagement)
| Severity | First responder | Channel | ACK window | Escalates to | Esc. window |
|---|---|---|---|---|---|
| SEV-1 | Workflow Owner | SMS + primary chat + email | 5 min | Technical Counterpart → Executive Sponsor | 15 / 60 min |
| SEV-2 | Workflow Owner | Primary chat + email | 15 min | Technical Counterpart | 60 min |
| SEV-3 | Workflow Owner | Email (digest) | 60 min | Monthly review | 30 days |
Channel discipline
- SMS only for SEV-1. Using SMS for SEV-2s or SEV-3s trains the responder to ignore SMS.
- Primary chat is defined per engagement. Slack, Teams, WhatsApp — whatever the client already checks. Don’t invent a new channel for alerts.
- Email for digest and trail. Every alert creates an email record even if other channels are faster — email is the durable trail.
- Never group inbox. Alerts go to a person; group inboxes handle summaries.
- Test channels monthly. During the sprint, test every channel end-to-end weekly. Post-Gate-4, monthly test during the first office hour window.
Routing configuration
Every alert rule specifies the human role it routes to — not a name. The name resolves through a routing table maintained in the Governance Runbook. When someone goes on leave, the Workflow Owner updates one row of the routing table; no alert rule changes. This separation is what keeps the alert system maintainable past Gate 4.
Rollback & recovery.
Rollback is the most consequential action in the governance stack. Done well, it saves an engagement from a state disaster. Done poorly, it creates a worse state than the one it was trying to fix. This section is the procedure that keeps rollbacks recoverable.
The seven-step rollback procedure
- Confirm the problem. Read the audit trail; verify that a rollback is the right response (vs. an override or a code fix). A rollback is the right response when state was corrupted or catastrophic writes happened; it is the wrong response when an agent has a bug that would recur under any state.
- Pause the responsible agents. Via Override. Prevents the problem from compounding during the rollback.
- Identify the target snapshot. Typically the most recent snapshot before the incident. Verify snapshot integrity by reading its audit tab.
- Announce the rollback. Post in the shared channel: intent, target snapshot timestamp, expected duration, humans affected. Never roll back silently.
- Execute. Run the restore procedure. Document timestamps at each step.
- Verify. Read the post-restore audit tab; confirm state matches the target snapshot; confirm no agent invocations happened during the restore window.
- Resume. Unpause agents one at a time; monitor dashboards for 30 minutes; if green, post all-clear. If red, re-pause and escalate.
Rollback preconditions
| Precondition | Must be true before starting |
|---|---|
| Snapshot available | At least one snapshot from before the incident window exists and has been verified. |
| Snapshot integrity | The snapshot’s audit tab is coherent — same schema, no parse errors, row count within expected range. |
| Agent pauses active | Every agent that could write during the restore is paused. |
| Downstream notified | Any third-party system that consumes state (public website, RSS feed, external API) is aware rollback is imminent. |
| Signer identified | For partial rollbacks, a Studio-principal signature is required; signer is named before starting. |
| Full rollback signed off | For full rollbacks during the engagement, Sprint Lead approves; post-Gate-4, the Workflow Owner is the sole authority. |
Data that rollback does not restore
Rollback restores the state spreadsheet and dependent stores. It does not restore: third-party systems (once a tweet is posted it is posted); external inboxes (emails already sent stay sent); human memory (stakeholders who saw the bad output will remember it). Rollback is a technical action; the communication around a rolled-back incident is its own, non-technical workstream. Do not skip it.
Audit trail hygiene.
An audit trail is only as useful as it is searchable, trustworthy, and kept clean. Hygiene is the daily / weekly / monthly discipline that prevents the log from degrading into a noise field.
Daily hygiene (during sprint and post-Gate-4)
- Spot-check invocation completeness — every afternoon, read a random 5-row sample; confirm all fields populated; confirm correlation IDs present where expected.
- Check for silent failures — a day without a single failure row is suspicious. Zero failures means either perfect operation (rare) or logging broken (common).
- Verify append-only property — confirm yesterday’s row count is still present today. A drop in row count is an integrity event.
Weekly hygiene (Friday demo ritual)
- Review alert-to-log correlation — every alert that fired this week has a corresponding log entry that pre-dates the alert. Unmatched alerts are a routing bug.
- Review override-to-log correlation — every override in the state spreadsheet has a log entry. Unlogged overrides mean the human intervention signal is not reaching the audit surface.
- Archive the prior week — log rows older than N days move to the Archive tab (keeps the primary log performant; archives remain queryable).
Monthly hygiene (post-Gate-4)
- Run the forensics drill — pick a random event from the last 30 days; answer “what happened?” using only the audit log. If the answer requires context that isn’t in the log, patch the log to capture it.
- Tune alert thresholds — review false-positive rate; adjust thresholds; document changes.
- Review schema drift — confirm no columns were added or removed without audit-log entries; if drift is present, reconstruct and note.
Incident response & post-mortem.
Every SEV-1 and most SEV-2s trigger a post-mortem. The post-mortem is not a blame exercise — it is a structured reconstruction of what happened, why, and what will be changed. Incident response also follows a predictable rhythm: detect, contain, recover, communicate, remediate.
The live response flow
| Phase | Duration target | Activities |
|---|---|---|
| Detect | 0 – 5 min | Alert fires; first responder acknowledges; triage call made (is this real? SEV level?). |
| Contain | 5 – 30 min | Affected agents paused; downstream surfaces isolated; state locked. Nothing new breaks while investigation proceeds. |
| Recover | 30 min – 2 hrs | Diagnose the root cause; restore a known-good state (rollback if warranted); resume agents carefully. |
| Communicate | During & after | Stakeholders informed in real time; all-clear announced; post-mortem scheduled for within 72 hours. |
| Remediate | 72 hrs – 2 wks | Post-mortem written; remediation actions owned and dated; Risk Register updated. |
Post-mortem template
▸ Incident summary
One paragraph. What happened, when, how long it lasted, what impact it had.
▸ Timeline
Minute-by-minute reconstruction from the audit log. Every alert, every human action, every system response. Timestamps to the minute.
▸ Root cause(s)
What actually caused the incident? Distinguish trigger (what fired it) from cause (what made it possible). Usually there are 2–3 contributing causes, not one.
▸ What went well
Catches. Good calls. Tools that worked. Remediations written on the spot. This section is mandatory — do not skip it.
▸ What went poorly
Gaps that were exposed. Missed alerts. Confused routing. Documentation that was wrong. No blame on people; focus on systems.
▸ Remediation actions
Specific, owned, dated changes that will be made to prevent recurrence. Each becomes a Risk Register item until completed.
▸ Did this match an existing pattern?
If the incident is a known failure mode, tag the relevant pattern card (USP-###). If it’s novel and recurs in a future engagement, it becomes a new pattern candidate.
Blameless discipline
- Describe actions in the passive voice where human names attach. “The agent was resumed at 14:17” not “Abe resumed the agent at 14:17.” Names in the timeline identify responders; they do not assign fault.
- Distinguish decision from outcome. A reasonable decision that led to a bad outcome is still a reasonable decision; the post-mortem critiques the decision environment, not the decider.
- Attack the system, not the human. If a human made a mistake, the post-mortem asks what in the system made the mistake possible or likely — not who to blame.
- Share the post-mortem broadly. Post-mortems are shared internally and with the client; they are evidence of rigour, not of failure.
Drill PM
Week-8 drill post-mortem excerpt. Incident: seeded corrupt research brief, Scenario A. Timeline: injection 14:07; quality-score drop detected by Monitoring 14:09; SEV-2 fired 14:10; Workflow Owner acknowledged 14:11; Editorial Writer paused via Override 14:13; rollback initiated 14:15; restore complete 14:19; Editorial Writer resumed 14:22; all-clear posted 14:24.
Went well: Detect-to-pause under 6 minutes. Rollback under 4 minutes. State integrity preserved. Went poorly: Initial alert went to a team channel and to Workflow Owner directly; team channel mirror caused a brief routing-confusion moment. Remediation: consolidated SEV-2 routing to Workflow Owner only + email digest to team; change made Day 38. Pattern: confirmed USP-014 (Pre-Condition Guards on LLM-Class Agents).
Governance Architect checklist.
Discovery
- Fit Assessment governance criteria reviewed; scores recorded.
- Any client compliance requirement (data residency, retention, auditability) captured in the Discovery Report.
- Pattern candidates from prior Studio engagements that might apply here noted in the Discovery appendix.
Redesign
- Each of the six components has a design — not just a mention — in the Redesign Blueprint.
- Per-agent governance mapping table completed (which component covers which agent, how).
- Risk Register v1 built with governance-originated risks.
- Gate 2 deck includes governance slides; I can present them.
Build
- Weeks 5–8 instrumentation schedule (see Build §06) followed; readiness matrix updated weekly.
- Every Friday demo includes a governance segment — what was instrumented, what was tested.
- Incident response drill executed Week 8 Thursday; transcript authored same week.
- Rollback demonstrated live; Workflow Owner can initiate.
- All alerts have been fired at least once in test; routing verified.
Handoff
- Governance drill on Day 48 exercised all six components; log signed by Workflow Owner.
- Governance Runbook is current and legible to the Workflow Owner without Studio assistance.
- Alert routing table transferred to Workflow Owner; they know how to update it.
- Post-Handoff monthly governance review scheduled for Day 85.
Support window & beyond
- Client has executed at least one real alert-to-resolution cycle unassisted within 30 days of Gate 4.
- Audit log hygiene has been reviewed at least once during the support window.
- Any post-mortem generated in the support window has been archived with the engagement record.
- Patterns observed post-Handoff are noted as candidates for future-sprint confirmation.