Governance Runbook.
Incident and override playbook. Severity model, alert response, escalation, override, rollback, audit trail, and the incident log. Governance is not a feature — it is the system's spine.
Paired with the Agent Spec Sheet (one per agent). Rehearsed as a drill in Week 7.
lighthouse-governance-runbook.docx. For filling in a live engagement, open the DOCX file.
Overview
GOVERNANCE RUNBOOK
Lighthouse Sprint — Stage 4: Handoff
How to respond to alerts, handle escalations, execute overrides, and roll back. Every deployed agent ships with this governance wrapper.
| Client | [Client Name] |
|---|---|
| Workflow | [Workflow Name] |
| Prepared by | [Governance Architect Name] |
| Date | [YYYY-MM-DD] |
| Version | [1.0] |
Section 1
1. Severity Levels
Every alert, incident, and escalation is classified by severity. This table defines response expectations.
| Severity | Definition | Examples | Response Time | Escalation |
|---|---|---|---|---|
| SEV-1 Critical | System down or producing harmful output. Active data loss or regulatory exposure. | Agent publishing incorrect data to production, system-wide outage, data corruption | Immediate (< 15 min) | Exec Sponsor + Governance Architect + Sprint Lead |
| SEV-2 High | Major degradation. Workflow partially broken or producing elevated errors. | One agent failing repeatedly, throughput dropped > 50%, integration broken | < 1 hour | Governance Architect + Workflow Owner |
| SEV-3 Medium | Noticeable issue but workflow still functional. Quality below target. | Error rate above threshold, slow response times, minor data quality issues | < 4 hours | Primary Operator + Team Lead |
| SEV-4 Low | Cosmetic or minor issue. No impact on output quality or throughput. | UI glitch in dashboard, non-critical log warning, minor formatting issue | Next business day | Primary Operator |
2. Monitoring
2.1 Dashboard Access
| Dashboard URL | [URL] |
|---|---|
| Access Credentials | [How to log in — role-based, SSO, etc.] |
| Refresh Rate | [Real-time / 1 min / 5 min] |
2.2 Metrics Monitored
List every metric the governance wrapper tracks, with its normal range and alert threshold.
| Metric | Normal Range | Warning Threshold | Critical Threshold | Alert Channel |
|---|---|---|---|---|
| Agent Success Rate | > 98% | < 95% | < 90% | [Slack / email / PagerDuty] |
| End-to-End Latency | [baseline] | [1.5x baseline] | [3x baseline] | [channel] |
| Error Rate | [baseline] | [2x baseline] | [5x baseline] | [channel] |
| Queue Depth | [normal] | [2x normal] | [5x normal] | [channel] |
| Cost per Cycle | [baseline] | [1.5x baseline] | [3x baseline] | [channel] |
| Human Override Rate | < 5% | > 10% | > 25% | [channel] |
3. Alert Response Procedures
Step-by-step response for each type of alert. Written for operators — not engineers.
3.1 Agent Failure Alert
Symptoms
[Agent stops producing output, error count spikes, queue depth growing]
Immediate Actions
[1. Check monitoring dashboard — identify which agent and what error]
[2. Check integration status — are source systems reachable?]
[3. Check recent changes — was anything deployed or changed in the last 24 hours?]
[4. Attempt restart if agent is stateless and error is transient]
Escalation Trigger
[Escalate if: agent does not recover after restart, or same failure occurred > 3 times in 24 hours]
3.2 Quality Degradation Alert
Symptoms
[Error rate above threshold, output quality flagged by human reviewers]
Immediate Actions
[1. Pause agent output pipeline — stop publishing/sending until reviewed]
[2. Review last 10 outputs — identify pattern in quality failures]
[3. Check input data quality — has the source data changed format or content?]
[4. Compare to baseline — is this a regression or a new failure mode?]
Escalation Trigger
[Escalate if: root cause not identified within 2 hours, or error rate > 2x threshold]
3.3 Performance Degradation Alert
Symptoms
[Latency exceeding threshold, throughput below target, queue backlog growing]
Immediate Actions
[1. Check system resources — CPU, memory, API rate limits]
[2. Check external dependencies — are integrations responding normally?]
[3. Check volume — is input volume within expected range?]
Escalation Trigger
[Escalate if: performance does not return to normal within 1 hour]
4. Escalation Procedures
4.1 Escalation Matrix
| Level | Who | When | How | SLA |
|---|---|---|---|---|
| L1 | Primary Operator | First responder for all alerts | Dashboard + runbook | Immediate |
| L2 | Team Lead / Workflow Owner | L1 cannot resolve within SLA | Slack / phone | < 30 min response |
| L3 | Governance Architect | Technical root cause needed | Slack / video call | < 1 hour response |
| L4 | Executive Sponsor | Business impact, SEV-1, or decisions needed | Phone / in-person | Immediate for SEV-1 |
| L5 | Umbra Studio (30-day window) | System design issue or bug | Email / Slack | < 4 hours response |
4.2 Escalation Communication Template
Use this format when escalating an issue.
| Severity | [SEV-1 / SEV-2 / SEV-3 / SEV-4] |
|---|---|
| Summary | [One-sentence description of the issue] |
| Impact | [What is broken? Who is affected? What is the business impact?] |
| Timeline | [When did it start? What has been tried?] |
| Request | [What do you need from the escalation target?] |
5. Override Procedures
How to manually override an agent when it needs to be stopped, paused, or its output corrected.
5.1 Emergency Stop
[How to immediately halt all agent activity. Step-by-step instructions.]
| Method | [Dashboard kill switch / API call / system command] |
|---|---|
| Access Required | [Who has permission to execute emergency stop] |
| Side Effects | [What happens to in-flight work, queued items, scheduled runs] |
| Recovery | [How to restart after an emergency stop] |
5.2 Pause & Resume
[How to temporarily pause a specific agent without stopping the entire workflow.]
| Pause Method | [How to pause — dashboard button, API, config change] |
|---|---|
| Queue Behavior | [What happens to items that arrive while paused] |
| Resume Method | [How to resume — and does it process the queue or skip] |
| Max Pause Duration | [How long can it stay paused before data loss risk] |
5.3 Manual Correction
[How to correct an agent's output after it has been produced but before downstream impact.]
| Correction Window | [How long after output before it becomes irreversible] |
|---|---|
| Correction Method | [Direct edit, reprocess, manual replacement] |
| Audit Trail | [How corrections are logged — who, when, what, why] |
6. Rollback Procedures
How to undo what an agent did and restore the previous state.
6.1 When to Rollback
[Agent produced incorrect output that reached production]
[Integration failure caused corrupted data in downstream systems]
[Agent operated outside its intended scope or guardrails]
6.2 Rollback Steps
Fill in specific rollback procedures for each agent and integration.
| Agent | Rollback Method | Data Recovery | Verification |
|---|---|---|---|
| [Agent Name] | [Step-by-step rollback procedure] | [How to restore previous data state] | [How to confirm rollback succeeded] |
| [Agent Name] | [Step-by-step rollback procedure] | [How to restore previous data state] | [How to confirm rollback succeeded] |
| [Agent Name] | [Step-by-step rollback procedure] | [How to restore previous data state] | [How to confirm rollback succeeded] |
| [Agent Name] | [Step-by-step rollback procedure] | [How to restore previous data state] | [How to confirm rollback succeeded] |
6.3 Post-Rollback
[1. Confirm all affected systems are back to known-good state]
[2. Notify all stakeholders of the rollback and its scope]
[3. Log the incident — what happened, why, what was rolled back, impact assessment]
[4. Conduct root cause analysis within 48 hours]
[5. Update this runbook if the incident revealed a gap]
7. Audit Trail
7.1 What Is Logged
Every agent action is logged. This section defines what the audit trail captures.
| Event Type | Data Captured | Retention Period |
|---|---|---|
| Agent Invocation | Timestamp, agent ID, trigger, inputs, outputs, duration, status | [30/60/90 days] |
| Human Approval | Timestamp, reviewer, decision (approve/reject), reason if rejected | [90 days] |
| Override / Stop | Timestamp, who, reason, affected items, resolution | [1 year] |
| Escalation | Timestamp, severity, escalation path, resolution, time to resolve | [1 year] |
| Rollback | Timestamp, who, scope, data affected, verification | [1 year] |
| Configuration Change | Timestamp, who, what changed, previous value, new value | [1 year] |
7.2 Accessing Audit Logs
| Log Location | [Dashboard / database / log management system] |
|---|---|
| Access Method | [URL, query tool, API endpoint] |
| Access Permissions | [Who can read logs? Who can export?] |
| Search Capabilities | [By agent, by date, by severity, by event type] |
8. Incident Log
Track all incidents here. Review monthly to identify patterns and update procedures.
| Date | SEV | Summary | Root Cause | Resolution | Duration |
|---|---|---|---|---|---|
| [Date] | [1-4] | [What happened] | [Why] | [What fixed it] | [Time to resolve] |
| [Date] | [1-4] | [What happened] | [Why] | [What fixed it] | [Time to resolve] |
| [Date] | [1-4] | [What happened] | [Why] | [What fixed it] | [Time to resolve] |
| [Date] | [1-4] | [What happened] | [Why] | [What fixed it] | [Time to resolve] |
| [Date] | [1-4] | [What happened] | [Why] | [What fixed it] | [Time to resolve] |
9. Change Log
| Date | Version | Changed By | Description |
|---|---|---|---|
| [YYYY-MM-DD] | [1.0] | [Governance Architect] | Initial version — created during Handoff stage |