DocumentGovernance Operations

ClassificationInternal

Versionv1.0 · 2026.04

OwnerUmbra Studio

Cross-phase · Governance · Operational Manual

Governance Operations.

Cross-phase operational manual for Lighthouse Sprint governance. Covers the six governance components, the spreadsheet-as-state-machine pattern, severity levels, alerting and escalation, rollback procedures, audit trail hygiene, and incident response discipline — the machinery that makes agent systems safe to run in production.

Paired with the Governance Runbook template. Read on Day 1 of the sprint; kept open through Handoff and the 30-day support window. Owned by the Governance Architect; read by every Studio role and the client Technical Counterpart.

§01

How to use this manual.

Audience · reading path · relationship to the phase playbooks

Governance Operations is cross-phase. Unlike the Discovery, Redesign, Build, and Handoff playbooks — each of which covers a specific phase — this manual runs horizontally across every phase of every sprint. It is the discipline that makes agent systems safe to run on a client’s production data.

The manual is written for the Governance Architect, who owns the work described here end to end, and is mandatory reading for the Sprint Lead (who approves and audits it), the Agent Engineer (who instruments what it specifies), and the client Technical Counterpart (who inherits the instrumentation at Gate 4).

It is not a design manual — the Redesign Playbook covers the design of the wrapper. This manual is the operational side: how the six components behave, how severity is reasoned about, how rollback is performed, how the audit log is kept clean. Everything in this manual should still be true six months after Gate 4.

Where this manual shows up in the sprint

Phase	Governance Operations is used for
Discovery	Reviewed by the Governance Architect to anticipate which components the system will need; informs the Fit Assessment Rubric’s governance criteria.
Redesign	Read alongside the Redesign Playbook §07. Defines the vocabulary for the Governance Wrapper design; informs per-agent governance mapping.
Build	Primary operational reference during the 4-week governance build-out (Weeks 5–8). Defines what “instrumented” means for each component.
Handoff	Basis of the Week-10 governance drills; becomes the Workflow Owner’s reference after Gate 4. The Operations Manual cites this document for deep governance questions.
Support window	When a client asks a question about governance in the 30-day support window, the answer lives somewhere in this manual.

§ Author’s note

Governance is the single highest-leverage investment in a Lighthouse Sprint. A sprint with weak agents and strong governance fails gracefully; a sprint with strong agents and weak governance fails catastrophically. When in doubt, over-instrument.

§02

Governance at a glance.

The architecture · the six components · the state pattern

Umbra Studio’s governance architecture has three layers: the agent layer, the state layer, and the governance wrapper layer. The wrapper is composed of six components, each with a specific responsibility. The state layer — the spreadsheet-as-state-machine (USP-009) — is the shared surface between humans and agents.

The three layers

Layer 01 ▸ agents

The agent layer.

Every agent has a spec — mission, trigger, inputs, decision rules, outputs, failure modes, success criteria, governance. The agent layer is the work; it is everything that writes outputs.

Layer 02 ▸ state

The state spreadsheet.

The canonical record of what exists, what is pending, what was approved. Writeable in narrow columns by humans; writeable in broader columns by agents. Editable in the same UI both parties already use. USP-009.

Layer 03 ▸ wrapper

The governance wrapper.

Six components that sit around the agents: Monitoring, Alerting, Audit Trail, Escalation, Override, Rollback. The wrapper has no business logic; it has only supervisory logic.

The six components

Component	Answers the question	Primary audience
Monitoring	How is the system behaving, right now?	Governance Architect, Workflow Owner
Alerting	Who needs to know, and how quickly?	Workflow Owner, Technical Counterpart
Audit Trail	What happened, and when, and why?	Workflow Owner (day 1 forensics)
Escalation	Who handles this when the first responder is stuck?	Executive Sponsor, Sprint Lead
Override	How do I stop a specific agent without stopping the system?	Workflow Owner
Rollback	How do I restore a known-good state?	Workflow Owner, Technical Counterpart

§ Completeness

If an agent system is missing any of the six components, it is not a production system. It is a prototype. A sprint that stops shipping agents before the wrapper is complete has shipped a liability.

§03

The six components.

Responsibilities · data model · “done” definition

▸ 01 · MON Monitoring. System behaviour · real time

Monitoring answers: “How is the system behaving right now?” It reads the audit trail and surfaces the signal humans need to judge system health without reading raw logs. Dashboards; not alerts — alerts are a separate component.

What Monitoring tracks

Per-agent throughput — invocations per hour, with trailing 24-hour comparison.
Per-agent error rate — failed / total, rolling 24-hour window.
Per-agent latency — p50 / p95 invocation duration.
End-to-end pipeline latency — from the first agent’s input to the last agent’s output, if the roster runs as a pipeline.
State spreadsheet growth rate — rows added per day per sheet.
Quality score — if the system has a scorable output (e.g. editorial review score), rolling average and delta from baseline.

“Done” definition

Monitoring is done when: (a) every agent has throughput, error, and latency tiles visible on a dashboard; (b) the dashboard is bookmarked on the Workflow Owner’s home tab; (c) the Workflow Owner can answer “how did the system do yesterday?” in under 60 seconds.

Common pitfalls

Dashboards only the Governance Architect can read. If the Workflow Owner can’t read it, it is not a dashboard — it is vanity. Build the view Workflow-Owner-first.
Too many tiles. Twelve tiles is a dashboard; sixty is a maze. Cut mercilessly.

▸ 02 · ALR Alerting. Active notification · severity-routed

Alerting answers: “Who needs to know, and how quickly?” It is the active-notification complement to Monitoring’s passive dashboards. Alerts are severity-routed (SEV-1 / SEV-2 / SEV-3; see §05) and go through specific channels to specific humans.

Alert rule anatomy

Name — human-legible; “Editorial Writer quality score fell below 0.70 for 1 hour” not “ALR-014.”
Condition — the measurable rule the alert fires on. Specific thresholds, windowed.
Severity — SEV-1, 2, or 3 per §05.
Routing — which human(s) receive it, through which channel.
Suppression — any conditions under which the rule should be silenced (maintenance window, known-issue flag).
Runbook link — a link into this manual or the Operations Manual describing the response procedure.

“Done” definition

Alerting is done when: (a) every agent has at least one SEV-2 rule; (b) every SEV level has been fired end-to-end at least once during Build (test or real); (c) the Workflow Owner has personally received at least one test alert and confirmed the delivery channel.

Common pitfalls

Alert fatigue. More than 5 alerts per day desensitises the responder. Tune thresholds or consolidate rules. An ignored alert is worse than no alert.
Routing to a team channel that no one reads. Alerts go to specific humans, not group inboxes. Group inboxes are for summaries, not alerts.

▸ 03 · AUD Audit trail. Immutable log · forensic-ready

Audit Trail answers: “What happened, and when, and why?” It is the immutable, append-only record of every agent invocation and every material human intervention. It is the ground truth that Monitoring summarises and that post-mortems refer to.

Required fields per invocation

Timestamp — ISO-8601 with timezone.
Agent name — human-legible.
Agent version — git SHA or semver.
Input hash — deterministic hash of the input payload.
Output hash — deterministic hash of the output payload.
Outcome — success, failure, partial, skipped.
Duration — end-to-end wall-clock in ms.
Trigger — what caused this invocation (schedule, spreadsheet change, manual, upstream agent).
Error — if outcome is failure, the error class and one-line message.
Correlation ID — a shared ID across agents in the same pipeline execution, so multi-agent flows can be reconstructed.

“Done” definition

Audit Trail is done when: (a) every agent writes every required field every invocation; (b) the Workflow Owner can answer “show me every invocation of agent X last Tuesday” in under 3 minutes, unassisted; (c) the log is append-only — no edits or deletions without a Studio-principal signature.

Common pitfalls

Missing correlation IDs. A multi-agent flow without a correlation ID is unreconstructible. Wire it on Day 22.
Timezone drift. Always log UTC in the raw record, render client-local in the UI. Mixing timezones is a post-mortem-killer.
Mutable logs. A log that can be edited is not an audit trail. Use append-only storage — typically a spreadsheet tab with no delete permissions.

▸ 04 · ESC Escalation. Human-to-human routing · time-bounded

Escalation answers: “Who handles this when the first responder is stuck?” It is the path a live incident takes when the initial responder cannot resolve it within a bounded window. Escalation is a human protocol; the system merely enforces it.

The three-tier escalation ladder

Tier 1 (First responder) — typically Workflow Owner. Window: 15 minutes for SEV-2, 5 minutes for SEV-1. If unresolved, escalate to Tier 2.
Tier 2 (Technical escalate) — Technical Counterpart (in-engagement) or Executive Sponsor (post-handoff). Window: 60 minutes. If unresolved, escalate to Tier 3.
Tier 3 (Strategic escalate) — Executive Sponsor and Sprint Lead (in-engagement) or Executive Sponsor and Studio (post-handoff). No further tier — this is the decision authority.

“Done” definition

Escalation is done when: (a) every SEV level has a named first responder and an explicit escalation path written down in the Runbook; (b) every person on the ladder has explicitly acknowledged they are on the ladder; (c) at least one SEV-2 has been escalated end-to-end during Build (test or real).

Common pitfalls

Escalation ladder with the same person at two tiers. Not an escalation ladder. Fix.
Tier 2 who didn’t know they were Tier 2. Escalation requires consent; get it on the record.

▸ 05 · OVR Override. Targeted pause · per-agent authority

Override answers: “How do I stop a specific agent without stopping the system?” It gives the Workflow Owner fine-grained control to pause, resume, or redirect individual agents, without taking down the whole roster. Implemented through the state-spreadsheet pattern (USP-009) as a human-writable column.

Override primitives

Pause — agent stops accepting new invocations; in-flight invocations finish.
Resume — agent accepts invocations again.
Force-failure — in-flight invocation marked failed; audit row written; downstream agents respect the failure.
Skip-next — next scheduled invocation skipped without pausing the agent longer-term.
Redirect — route to a different output destination (e.g. draft status instead of published) for a bounded window.

“Done” definition

Override is done when: (a) every agent is individually pausable and resumable from the state spreadsheet; (b) the pause is reflected in the audit trail as a human intervention with timestamp and signer; (c) the Workflow Owner has demonstrably paused and resumed at least one agent, solo, during Week 10.

Common pitfalls

Global kill switch masquerading as Override. A big red button that stops everything is crisis tooling, not Override. Override is targeted.
Pause that isn’t logged. Every override is an audit event. No exceptions.

▸ 06 · RBK Rollback. Known-good restoration · bounded window

Rollback answers: “How do I restore a known-good state?” It is the most consequential of the six components — the only one that can destroy work if misused, and also the only one that can save an engagement from a catastrophic state event. Procedures in §07.

Rollback primitives

Snapshot — a point-in-time copy of the state spreadsheet plus all dependent stores.
Restore — replace current state with a named snapshot; preserves audit trail.
Replay — re-run agent invocations from a timestamp, given the restored state (rarely used; high-cost).
Partial rollback — restore only specific tabs / rows / rows-matching-criterion; requires Studio-principal signature.

“Done” definition

Rollback is done when: (a) at least hourly snapshots run automatically during active hours; (b) snapshots are stored in a separate location from the primary state; (c) restore is demonstrated live, end-to-end, in under 10 minutes during the Week-8 incident response drill; (d) the Workflow Owner can initiate a restore, solo, using the Operations Manual as the only reference.

Common pitfalls

Snapshots stored in the same spreadsheet they back up. A fire destroys both. Separate store, separate credentials.
Snapshots never tested. An untested snapshot is not a snapshot — it is a hope. Drill.

§04

The spreadsheet as state machine.

USP-009 · the shared surface · column discipline

USP-009 is Umbra Studio’s highest-utility pattern. It says: use a structured spreadsheet as the canonical state store for a multi-agent system that must share authority with humans. Not a database with an admin UI; not a backing store behind a custom panel — the spreadsheet is the UI, and it is the store, and the agents read and write it directly through its API.

The pattern works because humans already know how to read and edit spreadsheets. Building a custom admin panel is expensive, takes design time, requires training, and creates a second source of truth. USP-009 short-circuits all of that. It is not appropriate for every engagement, but where it applies it can collapse weeks of admin-surface work into hours.

When to reach for USP-009

Fit signal	Explanation
Workflow Owner is already editing a spreadsheet	If the pre-agent workflow uses a spreadsheet as the source of truth, continue the metaphor rather than replace it.
State fits a tabular model	Rows = items, columns = attributes, one sheet per category. If the data is a graph or deeply nested, USP-009 is the wrong pattern.
Data volume < ~100K rows	Spreadsheets slow beyond ~100K rows. For high-volume systems, use a database with a thin admin layer.
Human edit frequency is moderate	If humans never edit the state, pure-database is simpler. If they edit constantly, a custom UI is probably needed. USP-009 is for moderate, context-aware editing.
Governance needs visibility	A spreadsheet you can see beats a black box you cannot. Regulators, auditors, and sponsors understand spreadsheets.

Column discipline — the rules that make USP-009 work

Human-only columns — agents never write these. Approval flags, priority overrides, notes, pauses. Typically 3–7 columns per sheet.
Agent-only columns — humans never write these (outside documented edge cases). Status, timestamps, hashes, URLs. Typically 8–15 columns per sheet.
Schema lock by Week 5 end. Post-Week-5 schema changes require Studio-principal signature and an audit-log entry naming the change and rationale.
No merged cells, no formulas agents depend on. Agents parse the sheet deterministically; merged cells and stateful formulas break parsers.
Every row has a stable ID. UUID or monotonic sequence — never derive identity from a mutable field (like a URL that might change).
Status columns use a closed vocabulary. Not free text — not free text, not free text. Document the vocabulary in the Operations Manual.
History preserved via append, not update. Status changes append to a history tab; the main sheet shows current state only.
Read-only columns are hard-read-only. Protected via sheet permissions, not just convention.

Typical sheet layout

Tab	Contents	Editable by
Queue	The live work items — one row per item, columns for status, priority, timestamps, URLs	Mixed (column-bounded)
Config	Agent configuration — rate limits, thresholds, feature flags	Workflow Owner + Agent Engineer
History	Append-only event log — every status change, agent invocation, override	Agents (append) + read-only for humans
Overrides	Per-agent pause / resume / skip directives	Workflow Owner only
Alerts	Recent alerts for dashboard context	Agents (append) + read-only for humans
Archive	Closed items moved out of Queue after N days	Agents (append) + read-only for humans

§ Why it holds

USP-009 works because it collapses three usually-separate surfaces — state store, admin UI, audit view — into one. Most of the patterns in the Pattern Library compose with it, which is why it is the most-referenced entry in the library.

§05

Severity levels.

SEV-1 / SEV-2 / SEV-3 · definitions and response expectations

Three severity levels, each with a specific definition, a specific response window, and a specific routing path. Never invent a fourth level — if an event doesn’t fit one of three, it is not an incident. The entire rest of the governance stack (alerting, escalation, rollback, audit) depends on these definitions being stable.

▸ SEV-1

Data at risk.

Definition: The system has produced or is about to produce output that would be unrecoverable or reputation-damaging if not intercepted. Published content that is wrong. Data writes that cannot be undone without manual labour.

Response: Immediate. Pause the responsible agent; initiate rollback if state was written; notify Sprint Lead and Workflow Owner within 5 minutes; Tier-1 first responder.

Examples: Review published with factual errors. State spreadsheet rows deleted or corrupted. Incorrect API call to a write-only third-party surface.

5-min acknowledge · full paper trail

▸ SEV-2

Quality breach.

Definition: The system is producing output below the agreed quality threshold but state is intact and nothing has shipped publicly. Rework is possible without labour explosion.

Response: Within 15 minutes. Notify Workflow Owner through primary channel; pause affected agent if quality drop is sustained; investigation may wait for next business hour.

Examples: Editorial quality score drops below baseline for 1 hour. Cover art retrieval returning wrong pressings 40% of the time.

15-min acknowledge · same-day resolve

▸ SEV-3

Degradation.

Definition: The system is operational but behaving sub-optimally. Elevated failure rates, latency spikes, partial outages of dependencies with retry-succeeded recovery.

Response: Within 60 minutes or next business hour. Log as a support ticket; no agent pause required; investigated as part of routine operations.

Examples: Spotify API 5xx rate elevated but retries succeed. Cycle time degraded from 8 min to 18 min without other quality impact.

60-min acknowledge · rolling resolve

Severity classification rules

Classify on evidence, not intuition. The alert rule names the severity; human reclassification is logged in the audit trail with rationale.
Escalate, never de-escalate silently. A SEV-2 can be promoted to SEV-1 mid-incident; de-escalation happens only after the incident closes and the post-mortem confirms the lesser severity.
One severity per incident. If two agents are failing simultaneously, that is two incidents, not one SEV-1-plus-a-SEV-3. Each has its own audit trail.
SEV-1 triggers a post-mortem regardless of outcome; SEV-2 triggers a post-mortem if response exceeded 30 minutes; SEV-3 is post-mortemed only at monthly review unless a pattern of repeat SEV-3s is observed.
False positives are tracked. Every alert that fires but turns out not to be real is logged; alert thresholds are tuned at monthly review.

§06

Alerting & escalation matrix.

Who gets what · when · via which channel

The matrix is the single artifact that ties severity levels, components, humans, and channels together. It must be readable in under a minute, kept up to date, and referenced by every alert rule. Re-read the matrix at every Friday demo and update if any line item has changed.

Default matrix (adapt per engagement)

Severity	First responder	Channel	ACK window	Escalates to	Esc. window
SEV-1	Workflow Owner	SMS + primary chat + email	5 min	Technical Counterpart → Executive Sponsor	15 / 60 min
SEV-2	Workflow Owner	Primary chat + email	15 min	Technical Counterpart	60 min
SEV-3	Workflow Owner	Email (digest)	60 min	Monthly review	30 days

Channel discipline

SMS only for SEV-1. Using SMS for SEV-2s or SEV-3s trains the responder to ignore SMS.
Primary chat is defined per engagement. Slack, Teams, WhatsApp — whatever the client already checks. Don’t invent a new channel for alerts.
Email for digest and trail. Every alert creates an email record even if other channels are faster — email is the durable trail.
Never group inbox. Alerts go to a person; group inboxes handle summaries.
Test channels monthly. During the sprint, test every channel end-to-end weekly. Post-Gate-4, monthly test during the first office hour window.

Routing configuration

Every alert rule specifies the human role it routes to — not a name. The name resolves through a routing table maintained in the Governance Runbook. When someone goes on leave, the Workflow Owner updates one row of the routing table; no alert rule changes. This separation is what keeps the alert system maintainable past Gate 4.

§ On-call by name

On-call rotations are a post-sprint optimisation. In-sprint, the Workflow Owner is always the first responder; there is no rotation to manage. Rotations become interesting when the engagement matures past 90 days post-Handoff.

§07

Rollback & recovery.

Procedures · preconditions · safety

Rollback is the most consequential action in the governance stack. Done well, it saves an engagement from a state disaster. Done poorly, it creates a worse state than the one it was trying to fix. This section is the procedure that keeps rollbacks recoverable.

The seven-step rollback procedure

Confirm the problem. Read the audit trail; verify that a rollback is the right response (vs. an override or a code fix). A rollback is the right response when state was corrupted or catastrophic writes happened; it is the wrong response when an agent has a bug that would recur under any state.
Pause the responsible agents. Via Override. Prevents the problem from compounding during the rollback.
Identify the target snapshot. Typically the most recent snapshot before the incident. Verify snapshot integrity by reading its audit tab.
Announce the rollback. Post in the shared channel: intent, target snapshot timestamp, expected duration, humans affected. Never roll back silently.
Execute. Run the restore procedure. Document timestamps at each step.
Verify. Read the post-restore audit tab; confirm state matches the target snapshot; confirm no agent invocations happened during the restore window.
Resume. Unpause agents one at a time; monitor dashboards for 30 minutes; if green, post all-clear. If red, re-pause and escalate.

Rollback preconditions

Precondition	Must be true before starting
Snapshot available	At least one snapshot from before the incident window exists and has been verified.
Snapshot integrity	The snapshot’s audit tab is coherent — same schema, no parse errors, row count within expected range.
Agent pauses active	Every agent that could write during the restore is paused.
Downstream notified	Any third-party system that consumes state (public website, RSS feed, external API) is aware rollback is imminent.
Signer identified	For partial rollbacks, a Studio-principal signature is required; signer is named before starting.
Full rollback signed off	For full rollbacks during the engagement, Sprint Lead approves; post-Gate-4, the Workflow Owner is the sole authority.

Data that rollback does not restore

Rollback restores the state spreadsheet and dependent stores. It does not restore: third-party systems (once a tweet is posted it is posted); external inboxes (emails already sent stay sent); human memory (stakeholders who saw the bad output will remember it). Rollback is a technical action; the communication around a rolled-back incident is its own, non-technical workstream. Do not skip it.

§ Rollback drill

The Week-8 incident response drill must include a rollback — either as the drill itself (Scenario C) or as the remediation for Scenarios A / B. A sprint that has never performed a live rollback has not yet instrumented rollback.

§08

Audit trail hygiene.

Keeping the log useful — forever

An audit trail is only as useful as it is searchable, trustworthy, and kept clean. Hygiene is the daily / weekly / monthly discipline that prevents the log from degrading into a noise field.

Daily hygiene (during sprint and post-Gate-4)

Spot-check invocation completeness — every afternoon, read a random 5-row sample; confirm all fields populated; confirm correlation IDs present where expected.
Check for silent failures — a day without a single failure row is suspicious. Zero failures means either perfect operation (rare) or logging broken (common).
Verify append-only property — confirm yesterday’s row count is still present today. A drop in row count is an integrity event.

Weekly hygiene (Friday demo ritual)

Review alert-to-log correlation — every alert that fired this week has a corresponding log entry that pre-dates the alert. Unmatched alerts are a routing bug.
Review override-to-log correlation — every override in the state spreadsheet has a log entry. Unlogged overrides mean the human intervention signal is not reaching the audit surface.
Archive the prior week — log rows older than N days move to the Archive tab (keeps the primary log performant; archives remain queryable).

Monthly hygiene (post-Gate-4)

Run the forensics drill — pick a random event from the last 30 days; answer “what happened?” using only the audit log. If the answer requires context that isn’t in the log, patch the log to capture it.
Tune alert thresholds — review false-positive rate; adjust thresholds; document changes.
Review schema drift — confirm no columns were added or removed without audit-log entries; if drift is present, reconstruct and note.

§ Forensics first

The test of an audit trail is: can you reconstruct what happened, six months from now, using only the log? If the answer is no, the log is incomplete — fix it while you can still remember what was missing.

§09

Incident response & post-mortem.

The 72-hour loop · blameless writeups · remediation discipline

Every SEV-1 and most SEV-2s trigger a post-mortem. The post-mortem is not a blame exercise — it is a structured reconstruction of what happened, why, and what will be changed. Incident response also follows a predictable rhythm: detect, contain, recover, communicate, remediate.

The live response flow

Phase	Duration target	Activities
Detect	0 – 5 min	Alert fires; first responder acknowledges; triage call made (is this real? SEV level?).
Contain	5 – 30 min	Affected agents paused; downstream surfaces isolated; state locked. Nothing new breaks while investigation proceeds.
Recover	30 min – 2 hrs	Diagnose the root cause; restore a known-good state (rollback if warranted); resume agents carefully.
Communicate	During & after	Stakeholders informed in real time; all-clear announced; post-mortem scheduled for within 72 hours.
Remediate	72 hrs – 2 wks	Post-mortem written; remediation actions owned and dated; Risk Register updated.

Post-mortem template

▸ Incident summary

One paragraph. What happened, when, how long it lasted, what impact it had.

▸ Timeline

Minute-by-minute reconstruction from the audit log. Every alert, every human action, every system response. Timestamps to the minute.

▸ Root cause(s)

What actually caused the incident? Distinguish trigger (what fired it) from cause (what made it possible). Usually there are 2–3 contributing causes, not one.

▸ What went well

Catches. Good calls. Tools that worked. Remediations written on the spot. This section is mandatory — do not skip it.

▸ What went poorly

Gaps that were exposed. Missed alerts. Confused routing. Documentation that was wrong. No blame on people; focus on systems.

▸ Remediation actions

Specific, owned, dated changes that will be made to prevent recurrence. Each becomes a Risk Register item until completed.

▸ Did this match an existing pattern?

If the incident is a known failure mode, tag the relevant pattern card (USP-###). If it’s novel and recurs in a future engagement, it becomes a new pattern candidate.

Blameless discipline

Describe actions in the passive voice where human names attach. “The agent was resumed at 14:17” not “Abe resumed the agent at 14:17.” Names in the timeline identify responders; they do not assign fault.
Distinguish decision from outcome. A reasonable decision that led to a bad outcome is still a reasonable decision; the post-mortem critiques the decision environment, not the decider.
Attack the system, not the human. If a human made a mistake, the post-mortem asks what in the system made the mistake possible or likely — not who to blame.
Share the post-mortem broadly. Post-mortems are shared internally and with the client; they are evidence of rigour, not of failure.

Indietheka
Drill PM

Week-8 drill post-mortem excerpt. Incident: seeded corrupt research brief, Scenario A. Timeline: injection 14:07; quality-score drop detected by Monitoring 14:09; SEV-2 fired 14:10; Workflow Owner acknowledged 14:11; Editorial Writer paused via Override 14:13; rollback initiated 14:15; restore complete 14:19; Editorial Writer resumed 14:22; all-clear posted 14:24.

Went well: Detect-to-pause under 6 minutes. Rollback under 4 minutes. State integrity preserved. Went poorly: Initial alert went to a team channel and to Workflow Owner directly; team channel mirror caused a brief routing-confusion moment. Remediation: consolidated SEV-2 routing to Workflow Owner only + email digest to team; change made Day 38. Pattern: confirmed USP-014 (Pre-Condition Guards on LLM-Class Agents).

§10

Governance Architect checklist.

Per-phase · ongoing · post-Gate-4

Discovery

Fit Assessment governance criteria reviewed; scores recorded.
Any client compliance requirement (data residency, retention, auditability) captured in the Discovery Report.
Pattern candidates from prior Studio engagements that might apply here noted in the Discovery appendix.

Redesign

Each of the six components has a design — not just a mention — in the Redesign Blueprint.
Per-agent governance mapping table completed (which component covers which agent, how).
Risk Register v1 built with governance-originated risks.
Gate 2 deck includes governance slides; I can present them.

Build

Weeks 5–8 instrumentation schedule (see Build §06) followed; readiness matrix updated weekly.
Every Friday demo includes a governance segment — what was instrumented, what was tested.
Incident response drill executed Week 8 Thursday; transcript authored same week.
Rollback demonstrated live; Workflow Owner can initiate.
All alerts have been fired at least once in test; routing verified.

Handoff

Governance drill on Day 48 exercised all six components; log signed by Workflow Owner.
Governance Runbook is current and legible to the Workflow Owner without Studio assistance.
Alert routing table transferred to Workflow Owner; they know how to update it.
Post-Handoff monthly governance review scheduled for Day 85.

Support window & beyond

Client has executed at least one real alert-to-resolution cycle unassisted within 30 days of Gate 4.
Audit log hygiene has been reviewed at least once during the support window.
Any post-mortem generated in the support window has been archived with the engagement record.
Patterns observed post-Handoff are noted as candidates for future-sprint confirmation.

§ Closing principle

Governance is the discipline that lets a system outlive the sprint. Every hour invested in wrapper, audit, rollback, and post-mortem is an hour the client will thank Studio for, long after the engagement has closed.