Agent Runtime.
The technical substrate on which Umbra Studio builds and runs agent systems. Claude API and model tier selection, the Agent SDK, MCP server inventory and patterns, prompt versioning, state-machine infrastructure, observability, deployment targets, and cost caps. This is Layer 3 of the stack map — the machinery between a spec and a production run.
Paired with the Lighthouse Agent Spec Sheet template (USO-LH-11) and the Governance Runbook (USO-LH-07). Owned by the Agent Engineer; read by everyone who touches agent code, specs, or deploys.
How to use this doc.
This is the L3 operating manual. It is read front-to-back once per quarter and consulted point-by-point during every sprint. It tells the Agent Engineer which model to call, where state lives, how prompts are versioned, how observability lands on the state sheet, and which deployment target to use. It is the first doc to update when the underlying tooling changes — Anthropic ships a new model, a new MCP server enters the inventory, a workflow graduates from Sheets to a real database.
This doc is not a tutorial on Claude or the API — Anthropic’s documentation is the canonical reference for that. This is the Studio’s opinionated slice: how we choose, what we pair things with, and what we refuse to do.
What this doc covers · what it does not
| Covered here | Lives elsewhere |
|---|---|
| Claude model selection (Opus / Sonnet / Haiku) | The actual prompt content for a given agent — lives in the client repo & agent spec |
| Agent SDK usage patterns we standardize on | API reference & SDK changelog — anthropic docs |
| MCP server inventory & “build vs. buy” rules | Internal knowledge about a client’s specific integration — client runbook |
| Google Sheets + Apps Script as state machine | When to hand off governance to the client — Handoff Playbook (USO-LH-06) |
| Prompt versioning & regression testing | Prompt engineering craft — Anthropic’s prompt engineering guide |
| Observability (what to log, where) | The six governance components in depth — Governance Operations (USO-LH-07) |
| Deployment targets & when to use each | Vercel / GCP provider docs |
| Cost caps & rate limits we enforce | Budget envelope & per-sprint variable — Master (USO-ST-01) §04 |
Runtime at a glance.
Every Lighthouse Sprint deploys the same shape, regardless of the client’s domain. There are three runtime layers and they never vary: the model layer, the orchestration layer, and the state layer. Clients change; wrapper shapes change; this does not.
Model layer.
Where reasoning happens. Claude API, three tiers. All agent decisions, extractions, and synthesis run here. No reasoning gets embedded in glue code.
Orchestration layer.
Where the model calls tools, loops, and hands control around. Claude Agent SDK + MCP servers + a small, boring harness of our own. This layer is where we earn the right to own patterns.
State layer.
Where the workflow lives between runs. Google Sheet + Apps Script, until a named pressure graduates the workflow to Airtable or a real database. USP-009 governs when to move.
The default sprint runtime — what gets deployed on every engagement
| Component | Default | Purpose | When we swap |
|---|---|---|---|
| Model | Claude Opus / Sonnet / Haiku | Reasoning for every agent run | Never swap the vendor; do swap the tier per run. |
| Orchestration | Claude Agent SDK (TypeScript) | Tool calls, loops, safety rails | Raw-API harness only for throwaway scripts. |
| Tools | MCP servers (official first, custom second) | External system access (Slack, Notion, Sheets, HTTP, etc.) | Custom MCP when no official server exists. |
| State | Google Sheets + Apps Script | Workflow queue, governance columns, audit log | Airtable / Supabase when §05 pressures fire. |
| Code hosting | GitHub private repo per client | Source, review, CI | Never mixed across clients. |
| Deployment | Vercel Functions (simple) · Cloud Run (long-running) | Where agent code runs | Per §08 criteria. |
| Secrets | 1Password + Vercel / GCP secret manager | API keys, client creds | Never env vars committed to repo. |
| Observability | State sheet columns + Cloud Logging + email alerts | Monitoring, alerting, audit trail | External SaaS only if client-mandated. |
Model tiers.
Every agent spec declares a default model tier in its front matter. That choice is reviewed at the Redesign gate. Tier is the single largest cost and latency lever in the entire runtime — the wrong default blows the variable-cost line within a week.
Planning & synthesis.
Highest reasoning ceiling. Use for workflows where the model must hold a lot in its head at once and produce a single high-stakes output. Slower. Most expensive.
Execution & agents.
Default for most production agent runs. Strong reasoning with usable latency and reasonable cost. Handles structured output, tool calls, long chains. The Studio’s default answer when an engineer is not sure.
Routing & guards.
Fast and cheap. Use for tight-loop, high-volume calls: classification, routing, format checks, pre-condition guards (USP-014). If you are about to call Opus 10 times in sequence, a Haiku router in front of it is usually the right move.
Decision table — which tier at spec-write time
| If the agent… | Default tier | Why |
|---|---|---|
| Runs at most a few times per week | Sonnet | Cost pressure is low; reasoning headroom matters more. |
| Runs dozens of times per run (fan-out inside one invocation) | Haiku for the fan-out, Sonnet for the orchestrator | Cost multiplies with fan-out. |
| Writes a document a human will read | Sonnet (default) or Opus (stakes) | Quality bar matters; not cost-bound. |
| Classifies / routes / picks one-of-N | Haiku | Classification does not need Sonnet’s reasoning. |
| Produces structured JSON from messy input | Sonnet | Has the judgment for schema repair; Haiku is often too brittle. |
| Holds > 40 pages of context | Opus | Long-context reasoning is a reason to pay for Tier A. |
| Needs to be audited by a reviewer agent | Sonnet for both | Matched-tier reviewers catch each other’s drift; mismatched ones thrash. |
| Runs inside a pre-condition guard (USP-014) | Haiku | Guards run constantly; cost pressure is real. |
| Writes code a human will deploy | Opus | The one place we refuse to trade quality for cost. |
| Synthesizes interview transcripts into themes | Opus | Planning/synthesis territory. |
Cost-to-latency-to-quality intuition — rough numbers
| Tier | Relative cost | Typical latency | Quality ceiling |
|---|---|---|---|
| Haiku 4.5 | 1× | ~1–3 s | Good-enough for structured tasks; brittle on ambiguity. |
| Sonnet 4.6 | ~5–10× | ~3–8 s | Studio default — rarely the wrong choice. |
| Opus 4.6 | ~25–50× | ~6–20 s | Highest reasoning; stop and ask whether it’s needed. |
Actual numbers depend on prompt length, output length, tool-call fan-out, and Anthropic’s current pricing — check the docs before rebuilding a variable-cost model. The shape of the tradeoff is what matters for spec-write decisions.
MCP servers.
MCP is how agents reach the world. A Studio agent almost never calls an external API directly — it calls a Model Context Protocol server that wraps the API with a tool interface, an auth scope, and observable boundaries. This section names the servers we rely on, the rules for when to build a custom one, and where they run.
Standard inventory — servers we reach for first
| Server | Purpose | Source | Auth |
|---|---|---|---|
| slack | Post messages, read threads, react, scheduled sends | Official | OAuth, per workspace |
| gmail | Read, search, draft, label | Official | OAuth, per account |
| google-calendar | Read / create / move events | Official | OAuth, per account |
| google-drive | Read doc/sheet/slide metadata & content | Official | OAuth, per account |
| sheets-state | Our own wrapper over Sheets API — state-machine primitives (lock row, write status, append audit) | Custom (Studio) | Service account |
| notion | Page create/edit, database query | Official | Integration token |
| linear | Issue create/update, cycle read | Official | API key |
| github | PR, issue, branch, file | Official | PAT or app |
| fetch | Constrained HTTP GET with allow-list of hosts | Official | None (host allow-list only) |
| wp-publish | WordPress draft / publish — used by Indietheka pipelines | Custom (Studio) | App password |
| album-art | Cover-art fetch (Album of the Year lookup) | Custom (Studio) | None |
Build vs. buy — when we write a custom MCP server
We default to an official or community-published MCP server whenever one exists and fits. We write a custom server only when the task is load-bearing enough to own or when no public server covers it.
- No server exists. The target system has no official or community MCP, and wrapping the API is cheaper than using a human every day.
- The existing server is too broad. Official servers often expose hundreds of tools; an agent works best with 5–15 scoped tools. A custom wrapper pares down.
- We need Studio-specific primitives on top. Example:
sheets-stateexists because raw Sheets access does not have “lock row + update status + append audit” as a single atomic operation. - Audit / observability requirements are not met. If we need to log every call with a correlation ID and the existing server will not, we wrap it.
- Three sprints have reused the custom wrapper we built ad hoc. Promote to a shared internal MCP server in the Studio catalog.
Where MCP servers run
The MCP server ships in the same repo and runs in the same process as the agent harness — typically as a stdio-transport server. Simple; no network hop; easy to debug.
Use when
- The server is lightweight and stateless.
- Only one agent uses it.
- The agent runs on a single deployment target.
The MCP server runs as its own HTTP service — Cloud Run, Vercel Function, or a long-running container. Multiple agents across multiple repos connect over HTTP(S).
Use when
- Two or more agents need the same tool surface.
- The server has meaningful state or caches.
- Credentials are scoped to the server, not the agents.
The MCP server runs on a human operator’s machine — e.g., Claude Desktop with a local MCP server for file system access or Apple Notes. Production agents do not use this pattern.
Use when
- A human is driving the session interactively.
- The tool requires access to local-only resources (clipboard, native app state).
- The risk envelope makes remote execution undesirable.
State-machine infrastructure.
The spreadsheet-as-state-machine pattern (USP-009) is not a nostalgia play. Google Sheets + Apps Script is the Studio’s default workflow substrate because it gets four things right that bespoke databases get wrong on sprint timelines: the client can see the state live, atomic-enough row locking exists, audit history is free, and rollback is a copy-paste.
The canonical sheet shape — six tabs, every time
| Tab | Purpose | Who writes |
|---|---|---|
| queue | Live work items — each row is one workflow instance; status column drives agent selection | Humans + agents |
| config | Flags, thresholds, cost caps, allow-lists. Agents read; humans edit. | Humans only |
| audit | Append-only. Every agent run appends a row with correlation ID, agent name, model, tokens, dollars, outcome. | Agents only |
| alerts | Active alerts (SEV, severity, first-seen, last-ack). Humans clear rows. | Agents write, humans clear |
| metrics | Rolling counters: runs today, errors today, dollars today, dollars MTD. | Agents only |
| readme | One-page README explaining the sheet. For the client. Never empty. | Humans only |
Row-locking & concurrency discipline
Apps Script has row-level concurrency gotchas. The Studio’s sheets-state MCP server wraps a lock-row primitive that uses LockService.getDocumentLock() plus a lock_token column. Every agent that claims a row writes its token; every release clears it. Stale tokens older than a timeout are reaped by a janitor trigger.
Graduation path — when to leave Sheets
Graduation is a move to Airtable (hosted, richer API, still visible to humans) or Supabase / Postgres (real DB, structured schemas, SQL, RLS). One of the five §07 pressures from the master must apply. We do not graduate because a spreadsheet “feels unprofessional.”
Airtable keeps the “humans can see the state” property while giving better concurrency and an API that does not suffer from Apps Script’s execution-time limits. Most workflows stay here.
Graduate to Airtable when
- Queue tab exceeds ~5,000 rows and interactive editing becomes slow.
- More than 3 agents need to write concurrently.
- A trigger-driven flow has hit Apps Script quota limits.
Alternatives to Airtable
- Notion databases — if the client already lives in Notion, acceptable substitute.
- Baserow / NocoDB — open-source alt if data residency is a requirement.
When the workflow has outgrown what a human-editable table can hold — strict schemas, constraints, transactional semantics, row-level security, real query needs. The state machine loses its “client can see it live” property unless we pair it with a thin dashboard (a Lovable or Retool page pointed at the DB).
Graduate to Supabase when
- You need real foreign-key constraints and multi-table transactions.
- Query latency must be sub-second.
- Compliance mandates data residency, row-level security, or audit at the DB level.
- Volume is past ~100k rows and growing.
Alternatives to Supabase
- Neon — Postgres with branching; nicer for dev / preview environments.
- PlanetScale — MySQL-compatible if the client is in that ecosystem.
- Turso — edge SQLite if the workload is read-heavy and distributed.
queue table exports back to a Sheet in under 30 minutes. Reversibility is cheap to design up front, expensive to retrofit.Prompt versioning.
Prompts are code. We treat them like code — they live in the repo, they version with semver, they have tests, they land through PRs. The Studio’s refusal to let prompts live in a prompt-store SaaS is deliberate: vendor lock-in on the most valuable artifact in an agent system is unacceptable (P-01).
File layout · one prompt per file, one file per version
Each versioned file has front matter: model_tier, temperature, max_tokens, tools, created_at, author, changelog. The body is the prompt itself — in Markdown, with sectioned instructions and examples as headings.
The CURRENT.md symlink is what the agent runtime loads. Rolling back is ln -sf v1.1.0.md CURRENT.md and a deploy. This is the cheapest rollback we have.
Semver discipline for prompts
| Change | Bump | Example |
|---|---|---|
| Fix a typo, clarify a sentence | patch | 1.1.0 → 1.1.1 |
| Add a new section or capability without removing any | minor | 1.1.1 → 1.2.0 |
| Remove or restructure instructions, change output schema, change model tier | major | 1.2.0 → 2.0.0 |
| Swap model vendor (we don’t, but hypothetically) | major | always major |
Regression discipline — evals against golden cases
Every prompt with ≥ 10 production runs has at least five golden cases in evals/<agent>/cases/. A golden case is an input + an expected shape of output (not a string match — a judge-prompt or a structural check). The eval runner replays every golden case against CURRENT.md and reports pass/fail.
Rules: major bumps require all goldens pass. Minor bumps require goldens in the affected section to pass. Patch bumps require no regressions on any golden. Any golden marked load-bearing in its front matter is non-negotiable at every bump.
Observability.
Observability is P-04 in action: governance is a column, not a tool. Every agent run produces signals that land in one of three places — the state sheet, the cloud logs, and an alert channel. We instrument every agent against this three-target discipline; nothing gets promoted to production without it.
The three observability targets
State sheet.
Every run appends a row to the audit tab. Every rollup refreshes the metrics tab. Every alert writes to alerts. The client opens one URL and sees the system.
Cloud logs.
Google Cloud Logging (or Vercel logs) captures raw request/response, stack traces, and latencies that do not belong in a sheet. Queryable by correlation ID. For engineers, not for clients.
Alert channel.
Email (primary) + Slack DM (secondary) for SEV-1/2 conditions. Silent SEV-3 rows append to the alerts tab for the next review. Nothing SEV-1 is silent.
Canonical audit row schema
| Field | Type | Required | Example |
|---|---|---|---|
| run_id | string | yes | 2026-04-21T14:03:11.r7qk |
| agent | string | yes | album-reviewer |
| version | string | yes | 1.1.1 |
| model | string | yes | claude-sonnet-4-6 |
| input_tokens | int | yes | 4,218 |
| output_tokens | int | yes | 1,903 |
| usd | decimal | yes | 0.0382 |
| duration_ms | int | yes | 5,441 |
| outcome | enum | yes | ok · retry · fail · skip |
| sev | enum | if not ok | sev-1 · sev-2 · sev-3 |
| error | string | if fail | rate_limit_429 |
| notes | string | optional | Free text. |
Deployment targets.
Three runtime targets handle every workload we build. Each has a narrow niche. Misplacing a workload between them is the second most common cost-drift cause after model-tier choice.
Short-lived, stateless, HTTP-triggered. Co-located with web properties. Trivially cheap. The Studio’s first reach.
Use when
- Agent invocations complete in under ~60 seconds.
- Triggered by HTTP, cron, or webhook.
- Stateless between runs (state lives on the sheet / DB).
Alternatives
- Cloudflare Workers — lower latency edge, tighter resource ceilings.
- Deno Deploy — if a workload is strictly Deno-native.
Container-based, scales to zero, minute-plus execution, up to 8 GB memory, 2 GiB temp disk, HTTP or Pub/Sub triggers. The Studio’s pick when Vercel’s 60-second ceiling is not enough.
Use when
- Agent invocations may run several minutes (Opus on long contexts; fan-out chains).
- Custom MCP servers running as HTTP services.
- Workloads needing more than ~1 GB memory.
Alternatives
- Fly.io — similar niche; better for persistent-connection workloads.
- AWS Lambda + Fargate — if the client already lives in AWS.
- Render / Railway — simpler DX at the cost of some observability depth.
Google-owned, runs inside the sheet, perfect for cheap triggers. Never put reasoning here — Apps Script is glue, not compute. A trigger calls out to a Vercel Function or Cloud Run service that calls Claude.
Use when
- You need an
onEdit/onFormSubmit/ time-based trigger to bootstrap an agent run. - A thin web-app endpoint is needed to accept writes back into the sheet.
- Cheap cron inside the sheet’s own quota envelope.
Alternatives
- Zapier / Make — if a non-developer client owns the trigger fabric.
- Vercel cron — when the trigger is not sheet-anchored.
Cost caps & rate limits.
Runaway agent loops are the single most expensive accidental behavior in the Studio’s runtime. Every agent is wrapped in three levels of cap: per-run, per-day, per-sprint. When any cap fires, the harness halts the agent and writes a SEV-2 alert. Caps are non-negotiable.
| Cap | Scope | Default | What happens on breach |
|---|---|---|---|
| Tokens per run | Single invocation | max 200k in + 32k out | Harness aborts; audit row with outcome=fail, error=token_cap. |
| Dollars per run | Single invocation | $3.00 (Sonnet) · $12 (Opus) | Harness aborts; audit row + SEV-3 alert. |
| Tool calls per run | Single invocation | 40 | Harness aborts; usually indicates agent is stuck in a loop. |
| Dollars per agent per day | Agent-scoped rolling 24 h | $50 default, per-spec override | Pause new invocations; existing runs complete; SEV-2 alert. |
| Dollars per sprint total | All agents in a client repo | Per §04 variable envelope × 1.25 | Pause all agents; SEV-1 alert to Sprint Lead; manual release required. |
| Concurrent runs per agent | Agent-scoped | 3 | Queue incoming; reject after queue depth 10. |
| API rate limit (Anthropic) | Account-wide | Per tier | Honor 429; exponential backoff with jitter; SEV-3 if sustained > 5 min. |
Circuit-breaker pattern — the harness enforces; the sheet records
The harness checks caps before each API call and after each tool call. On breach, it halts, writes the audit row with outcome=fail and an error that names the cap, appends a row to alerts, and sends the email. A human clears the alert only after the root cause is understood — caps do not auto-reset.
Agent Engineer checklist.
Pre-sprint (before Day 1)
- Claude API key provisioned for the client’s scope; billing alarm at 50% / 80% / 100% of sprint envelope.
- GitHub private repo created; CI configured;
CURRENT.mdsymlink pattern scaffolded. - State sheet template forked; six canonical tabs in place; README populated.
sheets-stateMCP server connected; lock-row smoke test passes.- Deployment target chosen (Vercel vs. Cloud Run) per §08 criteria.
During Build (Weeks 5–9)
- Each new agent has a spec with explicit
model_tierdeclaration before first run. - Each agent has at least five golden eval cases before Week 7.
- Cost caps configured per §09 before the first production invocation.
- Audit row schema validated end-to-end for every agent (no silent runs).
- Weekly review of the
metricstab — variable cost tracking within the envelope.
At Gate 3 (Production Readiness)
- All agents green on their golden evals at
CURRENT.md. - Cost caps firing cleanly in a drill (deliberately trigger one; confirm harness halts and alert fires).
- Rollback drill:
ln -sf v-prev.md CURRENT.md, deploy, verify previous behavior. Under 15 minutes. - Secrets rotation: every key used in sprint was created or rotated inside the sprint window; nothing inherited from before.
- Observability signoff: three targets confirmed populating for every agent.
At Gate 4 (Independence Verified)
- Client holds their own Anthropic account; Studio key revoked from production path.
- Client holds credentials for every MCP server the system uses; Studio credentials revoked.
- Deployment target transferred to client’s org (Vercel team, GCP project).
- Repo ownership transferred or mirrored per contract.
- Post-mortem filed for every SEV-2+ event that occurred during the sprint.
- One candidate pattern nominated for USP library inclusion.