DocumentAgent Runtime

ClassificationInternal

Versionv1.0 · 2026.04

OwnerAgent Engineer

Studio Ops Kit · L3 Build & Runtime · Internal

Agent Runtime.

The technical substrate on which Umbra Studio builds and runs agent systems. Claude API and model tier selection, the Agent SDK, MCP server inventory and patterns, prompt versioning, state-machine infrastructure, observability, deployment targets, and cost caps. This is Layer 3 of the stack map — the machinery between a spec and a production run.

Paired with the Lighthouse Agent Spec Sheet template (USO-LH-11) and the Governance Runbook (USO-LH-07). Owned by the Agent Engineer; read by everyone who touches agent code, specs, or deploys.

§01

How to use this doc.

Audience · reading path · what this doc does not cover

This is the L3 operating manual. It is read front-to-back once per quarter and consulted point-by-point during every sprint. It tells the Agent Engineer which model to call, where state lives, how prompts are versioned, how observability lands on the state sheet, and which deployment target to use. It is the first doc to update when the underlying tooling changes — Anthropic ships a new model, a new MCP server enters the inventory, a workflow graduates from Sheets to a real database.

This doc is not a tutorial on Claude or the API — Anthropic’s documentation is the canonical reference for that. This is the Studio’s opinionated slice: how we choose, what we pair things with, and what we refuse to do.

What this doc covers · what it does not

Covered here	Lives elsewhere
Claude model selection (Opus / Sonnet / Haiku)	The actual prompt content for a given agent — lives in the client repo & agent spec
Agent SDK usage patterns we standardize on	API reference & SDK changelog — anthropic docs
MCP server inventory & “build vs. buy” rules	Internal knowledge about a client’s specific integration — client runbook
Google Sheets + Apps Script as state machine	When to hand off governance to the client — Handoff Playbook (USO-LH-06)
Prompt versioning & regression testing	Prompt engineering craft — Anthropic’s prompt engineering guide
Observability (what to log, where)	The six governance components in depth — Governance Operations (USO-LH-07)
Deployment targets & when to use each	Vercel / GCP provider docs
Cost caps & rate limits we enforce	Budget envelope & per-sprint variable — Master (USO-ST-01) §04

§ Principle refresh

L3 decisions inherit every principle from the master. When a choice here contradicts P-01 (own primitives, rent commodities), P-04 (governance is a column), or P-05 (reversible by default), the principle wins.

§02

Runtime at a glance.

Three layers · same substrate across every sprint

Every Lighthouse Sprint deploys the same shape, regardless of the client’s domain. There are three runtime layers and they never vary: the model layer, the orchestration layer, and the state layer. Clients change; wrapper shapes change; this does not.

ModelL3 / top

Model layer.

Where reasoning happens. Claude API, three tiers. All agent decisions, extractions, and synthesis run here. No reasoning gets embedded in glue code.

OrchestrationL3 / mid

Orchestration layer.

Where the model calls tools, loops, and hands control around. Claude Agent SDK + MCP servers + a small, boring harness of our own. This layer is where we earn the right to own patterns.

StateL3 / bottom

State layer.

Where the workflow lives between runs. Google Sheet + Apps Script, until a named pressure graduates the workflow to Airtable or a real database. USP-009 governs when to move.

The default sprint runtime — what gets deployed on every engagement

Component	Default	Purpose	When we swap
Model	Claude Opus / Sonnet / Haiku	Reasoning for every agent run	Never swap the vendor; do swap the tier per run.
Orchestration	Claude Agent SDK (TypeScript)	Tool calls, loops, safety rails	Raw-API harness only for throwaway scripts.
Tools	MCP servers (official first, custom second)	External system access (Slack, Notion, Sheets, HTTP, etc.)	Custom MCP when no official server exists.
State	Google Sheets + Apps Script	Workflow queue, governance columns, audit log	Airtable / Supabase when §05 pressures fire.
Code hosting	GitHub private repo per client	Source, review, CI	Never mixed across clients.
Deployment	Vercel Functions (simple) · Cloud Run (long-running)	Where agent code runs	Per §08 criteria.
Secrets	1Password + Vercel / GCP secret manager	API keys, client creds	Never env vars committed to repo.
Observability	State sheet columns + Cloud Logging + email alerts	Monitoring, alerting, audit trail	External SaaS only if client-mandated.

§ Opinionated default

A new sprint starts with this stack already assumed. We only deviate if a sprint produces a written exception citing a named pressure. Deviations are one-per-sprint, not one-per-agent.

§03

Model tiers.

Opus · Sonnet · Haiku · when each one · how to decide at spec-write time

Every agent spec declares a default model tier in its front matter. That choice is reviewed at the Redesign gate. Tier is the single largest cost and latency lever in the entire runtime — the wrong default blows the variable-cost line within a week.

Tier A · Opus 4.6

Planning & synthesis.

Highest reasoning ceiling. Use for workflows where the model must hold a lot in its head at once and produce a single high-stakes output. Slower. Most expensive.

Best for · spec drafting · ICPs · post-mortems · research synthesis

Tier B · Sonnet 4.6

Execution & agents.

Default for most production agent runs. Strong reasoning with usable latency and reasonable cost. Handles structured output, tool calls, long chains. The Studio’s default answer when an engineer is not sure.

Best for · end-to-end agents · orchestration · reviewers · most client work

Tier C · Haiku 4.5

Routing & guards.

Fast and cheap. Use for tight-loop, high-volume calls: classification, routing, format checks, pre-condition guards (USP-014). If you are about to call Opus 10 times in sequence, a Haiku router in front of it is usually the right move.

Best for · classifiers · routers · format-check guards · sanity filters

Decision table — which tier at spec-write time

If the agent…	Default tier	Why
Runs at most a few times per week	Sonnet	Cost pressure is low; reasoning headroom matters more.
Runs dozens of times per run (fan-out inside one invocation)	Haiku for the fan-out, Sonnet for the orchestrator	Cost multiplies with fan-out.
Writes a document a human will read	Sonnet (default) or Opus (stakes)	Quality bar matters; not cost-bound.
Classifies / routes / picks one-of-N	Haiku	Classification does not need Sonnet’s reasoning.
Produces structured JSON from messy input	Sonnet	Has the judgment for schema repair; Haiku is often too brittle.
Holds > 40 pages of context	Opus	Long-context reasoning is a reason to pay for Tier A.
Needs to be audited by a reviewer agent	Sonnet for both	Matched-tier reviewers catch each other’s drift; mismatched ones thrash.
Runs inside a pre-condition guard (USP-014)	Haiku	Guards run constantly; cost pressure is real.
Writes code a human will deploy	Opus	The one place we refuse to trade quality for cost.
Synthesizes interview transcripts into themes	Opus	Planning/synthesis territory.

§ Tier escalation

If a Haiku agent fails its evals, escalate to Sonnet before tweaking prompts for a week. If a Sonnet agent fails quality bar in production, escalate to Opus as a diagnostic — if Opus succeeds, the task was under-tiered; if Opus also fails, it is a spec problem, not a model problem.

Cost-to-latency-to-quality intuition — rough numbers

Tier	Relative cost	Typical latency	Quality ceiling
Haiku 4.5	1×	~1–3 s	Good-enough for structured tasks; brittle on ambiguity.
Sonnet 4.6	~5–10×	~3–8 s	Studio default — rarely the wrong choice.
Opus 4.6	~25–50×	~6–20 s	Highest reasoning; stop and ask whether it’s needed.

Actual numbers depend on prompt length, output length, tool-call fan-out, and Anthropic’s current pricing — check the docs before rebuilding a variable-cost model. The shape of the tradeoff is what matters for spec-write decisions.

§04

MCP servers.

Inventory · build-vs-buy rules · scope · where each server runs

MCP is how agents reach the world. A Studio agent almost never calls an external API directly — it calls a Model Context Protocol server that wraps the API with a tool interface, an auth scope, and observable boundaries. This section names the servers we rely on, the rules for when to build a custom one, and where they run.

Standard inventory — servers we reach for first

Server	Purpose	Source	Auth
slack	Post messages, read threads, react, scheduled sends	Official	OAuth, per workspace
gmail	Read, search, draft, label	Official	OAuth, per account
google-calendar	Read / create / move events	Official	OAuth, per account
google-drive	Read doc/sheet/slide metadata & content	Official	OAuth, per account
sheets-state	Our own wrapper over Sheets API — state-machine primitives (lock row, write status, append audit)	Custom (Studio)	Service account
notion	Page create/edit, database query	Official	Integration token
linear	Issue create/update, cycle read	Official	API key
github	PR, issue, branch, file	Official	PAT or app
fetch	Constrained HTTP GET with allow-list of hosts	Official	None (host allow-list only)
wp-publish	WordPress draft / publish — used by Indietheka pipelines	Custom (Studio)	App password
album-art	Cover-art fetch (Album of the Year lookup)	Custom (Studio)	None

Build vs. buy — when we write a custom MCP server

We default to an official or community-published MCP server whenever one exists and fits. We write a custom server only when the task is load-bearing enough to own or when no public server covers it.

No server exists. The target system has no official or community MCP, and wrapping the API is cheaper than using a human every day.
The existing server is too broad. Official servers often expose hundreds of tools; an agent works best with 5–15 scoped tools. A custom wrapper pares down.
We need Studio-specific primitives on top. Example: sheets-state exists because raw Sheets access does not have “lock row + update status + append audit” as a single atomic operation.
Audit / observability requirements are not met. If we need to log every call with a correlation ID and the existing server will not, we wrap it.
Three sprints have reused the custom wrapper we built ad hoc. Promote to a shared internal MCP server in the Studio catalog.

§ Anti-pattern

Do not build a custom MCP server for a one-off. If an agent will use a tool twice in its life, put the HTTP call in a single TypeScript helper and move on. Custom MCP servers have a maintenance tax — only pay it for load-bearing integrations.

Where MCP servers run

Pattern A

Co-located with the agent.

Default

The MCP server ships in the same repo and runs in the same process as the agent harness — typically as a stdio-transport server. Simple; no network hop; easy to debug.

Use when

The server is lightweight and stateless.
Only one agent uses it.
The agent runs on a single deployment target.

Pattern B

Remote HTTP MCP.

For shared or multi-agent

The MCP server runs as its own HTTP service — Cloud Run, Vercel Function, or a long-running container. Multiple agents across multiple repos connect over HTTP(S).

Use when

Two or more agents need the same tool surface.
The server has meaningful state or caches.
Credentials are scoped to the server, not the agents.

Pattern C

Client-side MCP.

For human-in-the-loop only

The MCP server runs on a human operator’s machine — e.g., Claude Desktop with a local MCP server for file system access or Apple Notes. Production agents do not use this pattern.

Use when

A human is driving the session interactively.
The tool requires access to local-only resources (clipboard, native app state).
The risk envelope makes remote execution undesirable.

§05

State-machine infrastructure.

USP-009 in practice · Sheets + Apps Script as the default substrate · graduation path

The spreadsheet-as-state-machine pattern (USP-009) is not a nostalgia play. Google Sheets + Apps Script is the Studio’s default workflow substrate because it gets four things right that bespoke databases get wrong on sprint timelines: the client can see the state live, atomic-enough row locking exists, audit history is free, and rollback is a copy-paste.

The canonical sheet shape — six tabs, every time

Tab	Purpose	Who writes
queue	Live work items — each row is one workflow instance; status column drives agent selection	Humans + agents
config	Flags, thresholds, cost caps, allow-lists. Agents read; humans edit.	Humans only
audit	Append-only. Every agent run appends a row with correlation ID, agent name, model, tokens, dollars, outcome.	Agents only
alerts	Active alerts (SEV, severity, first-seen, last-ack). Humans clear rows.	Agents write, humans clear
metrics	Rolling counters: runs today, errors today, dollars today, dollars MTD.	Agents only
readme	One-page README explaining the sheet. For the client. Never empty.	Humans only

Row-locking & concurrency discipline

Apps Script has row-level concurrency gotchas. The Studio’s sheets-state MCP server wraps a lock-row primitive that uses LockService.getDocumentLock() plus a lock_token column. Every agent that claims a row writes its token; every release clears it. Stale tokens older than a timeout are reaped by a janitor trigger.

// canonical row-claim pattern (Apps Script) function claimRow(sheet, rowIdx, agentId) { const lock = LockService.getDocumentLock(); lock.waitLock(5000); try { const token = sheet.getRange(rowIdx, TOKEN_COL).getValue(); if (token && !isStale(token)) return null; const newToken = `${agentId}:${Date.now()}`; sheet.getRange(rowIdx, TOKEN_COL).setValue(newToken); return newToken; } finally { lock.releaseLock(); } }

Graduation path — when to leave Sheets

Graduation is a move to Airtable (hosted, richer API, still visible to humans) or Supabase / Postgres (real DB, structured schemas, SQL, RLS). One of the five §07 pressures from the master must apply. We do not graduate because a spreadsheet “feels unprofessional.”

Step 1

Airtable as intermediate.

Most common first graduation

Airtable keeps the “humans can see the state” property while giving better concurrency and an API that does not suffer from Apps Script’s execution-time limits. Most workflows stay here.

Graduate to Airtable when

Queue tab exceeds ~5,000 rows and interactive editing becomes slow.
More than 3 agents need to write concurrently.
A trigger-driven flow has hit Apps Script quota limits.

Alternatives to Airtable

Notion databases — if the client already lives in Notion, acceptable substitute.
Baserow / NocoDB — open-source alt if data residency is a requirement.

Step 2

Supabase / Postgres.

Terminal graduation

When the workflow has outgrown what a human-editable table can hold — strict schemas, constraints, transactional semantics, row-level security, real query needs. The state machine loses its “client can see it live” property unless we pair it with a thin dashboard (a Lovable or Retool page pointed at the DB).

Graduate to Supabase when

You need real foreign-key constraints and multi-table transactions.
Query latency must be sub-second.
Compliance mandates data residency, row-level security, or audit at the DB level.
Volume is past ~100k rows and growing.

Alternatives to Supabase

Neon — Postgres with branching; nicer for dev / preview environments.
PlanetScale — MySQL-compatible if the client is in that ecosystem.
Turso — edge SQLite if the workload is read-heavy and distributed.

§ Reversal rule

Every graduation must have a documented reversal path. If Supabase vanishes tomorrow, the queue table exports back to a Sheet in under 30 minutes. Reversibility is cheap to design up front, expensive to retrofit.

§06

Prompt versioning.

Where prompts live · how we version them · regression discipline

Prompts are code. We treat them like code — they live in the repo, they version with semver, they have tests, they land through PRs. The Studio’s refusal to let prompts live in a prompt-store SaaS is deliberate: vendor lock-in on the most valuable artifact in an agent system is unacceptable (P-01).

File layout · one prompt per file, one file per version

// repo layout prompts/ album-reviewer/ v1.0.0.md // frozen once promoted v1.1.0.md v1.1.1.md CURRENT.md // symlink -> active version seo-brief/ v2.0.0.md CURRENT.md evals/ album-reviewer/ cases/ golden-001.md golden-002.md runner.ts

Each versioned file has front matter: model_tier, temperature, max_tokens, tools, created_at, author, changelog. The body is the prompt itself — in Markdown, with sectioned instructions and examples as headings.

The CURRENT.md symlink is what the agent runtime loads. Rolling back is ln -sf v1.1.0.md CURRENT.md and a deploy. This is the cheapest rollback we have.

Semver discipline for prompts

Change	Bump	Example
Fix a typo, clarify a sentence	patch	1.1.0 → 1.1.1
Add a new section or capability without removing any	minor	1.1.1 → 1.2.0
Remove or restructure instructions, change output schema, change model tier	major	1.2.0 → 2.0.0
Swap model vendor (we don’t, but hypothetically)	major	always major

Regression discipline — evals against golden cases

Every prompt with ≥ 10 production runs has at least five golden cases in evals/<agent>/cases/. A golden case is an input + an expected shape of output (not a string match — a judge-prompt or a structural check). The eval runner replays every golden case against CURRENT.md and reports pass/fail.

Rules: major bumps require all goldens pass. Minor bumps require goldens in the affected section to pass. Patch bumps require no regressions on any golden. Any golden marked load-bearing in its front matter is non-negotiable at every bump.

§ The hard rule

A promoted prompt version file is immutable. If v1.1.0 has a bug, create v1.1.1. Never edit v1.1.0 in place — rollback depends on the previous version still behaving like it did yesterday.

§07

Observability.

What lands where · log schema · why the state sheet is still the dashboard

Observability is P-04 in action: governance is a column, not a tool. Every agent run produces signals that land in one of three places — the state sheet, the cloud logs, and an alert channel. We instrument every agent against this three-target discipline; nothing gets promoted to production without it.

The three observability targets

Target 1Primary

State sheet.

Every run appends a row to the audit tab. Every rollup refreshes the metrics tab. Every alert writes to alerts. The client opens one URL and sees the system.

Target 2Secondary

Cloud logs.

Google Cloud Logging (or Vercel logs) captures raw request/response, stack traces, and latencies that do not belong in a sheet. Queryable by correlation ID. For engineers, not for clients.

Target 3Push

Alert channel.

Email (primary) + Slack DM (secondary) for SEV-1/2 conditions. Silent SEV-3 rows append to the alerts tab for the next review. Nothing SEV-1 is silent.

Canonical audit row schema

Field	Type	Required	Example
run_id	string	yes	2026-04-21T14:03:11.r7qk
agent	string	yes	album-reviewer
version	string	yes	1.1.1
model	string	yes	claude-sonnet-4-6
input_tokens	int	yes	4,218
output_tokens	int	yes	1,903
usd	decimal	yes	0.0382
duration_ms	int	yes	5,441
outcome	enum	yes	ok · retry · fail · skip
sev	enum	if not ok	sev-1 · sev-2 · sev-3
error	string	if fail	rate_limit_429
notes	string	optional	Free text.

§ The rule

No audit row, no production run. Our harness refuses to complete an invocation without writing its audit row. We would rather fail loud than run silent.

§08

Deployment targets.

Where agent code runs · when each target is right

Three runtime targets handle every workload we build. Each has a narrow niche. Misplacing a workload between them is the second most common cost-drift cause after model-tier choice.

Target A

Vercel Functions.

Default for agents & webhooks

Short-lived, stateless, HTTP-triggered. Co-located with web properties. Trivially cheap. The Studio’s first reach.

Use when

Agent invocations complete in under ~60 seconds.
Triggered by HTTP, cron, or webhook.
Stateless between runs (state lives on the sheet / DB).

Alternatives

Cloudflare Workers — lower latency edge, tighter resource ceilings.
Deno Deploy — if a workload is strictly Deno-native.

Target B

Google Cloud Run.

For long-running or heavy

Container-based, scales to zero, minute-plus execution, up to 8 GB memory, 2 GiB temp disk, HTTP or Pub/Sub triggers. The Studio’s pick when Vercel’s 60-second ceiling is not enough.

Use when

Agent invocations may run several minutes (Opus on long contexts; fan-out chains).
Custom MCP servers running as HTTP services.
Workloads needing more than ~1 GB memory.

Alternatives

Fly.io — similar niche; better for persistent-connection workloads.
AWS Lambda + Fargate — if the client already lives in AWS.
Render / Railway — simpler DX at the cost of some observability depth.

Target C

Apps Script.

Only for sheet-bound triggers

Google-owned, runs inside the sheet, perfect for cheap triggers. Never put reasoning here — Apps Script is glue, not compute. A trigger calls out to a Vercel Function or Cloud Run service that calls Claude.

Use when

You need an onEdit / onFormSubmit / time-based trigger to bootstrap an agent run.
A thin web-app endpoint is needed to accept writes back into the sheet.
Cheap cron inside the sheet’s own quota envelope.

Alternatives

Zapier / Make — if a non-developer client owns the trigger fabric.
Vercel cron — when the trigger is not sheet-anchored.

§ Target discipline

Default to Vercel Functions. Escalate to Cloud Run only when a specific constraint bites. Use Apps Script only for the last-inch trigger. Two targets is the norm; three is a warning sign.

§09

Cost caps & rate limits.

Per-run · per-day · per-sprint · the circuit breakers that keep us safe

Runaway agent loops are the single most expensive accidental behavior in the Studio’s runtime. Every agent is wrapped in three levels of cap: per-run, per-day, per-sprint. When any cap fires, the harness halts the agent and writes a SEV-2 alert. Caps are non-negotiable.

Cap	Scope	Default	What happens on breach
Tokens per run	Single invocation	max 200k in + 32k out	Harness aborts; audit row with outcome=fail, error=token_cap.
Dollars per run	Single invocation	$3.00 (Sonnet) · $12 (Opus)	Harness aborts; audit row + SEV-3 alert.
Tool calls per run	Single invocation	40	Harness aborts; usually indicates agent is stuck in a loop.
Dollars per agent per day	Agent-scoped rolling 24 h	$50 default, per-spec override	Pause new invocations; existing runs complete; SEV-2 alert.
Dollars per sprint total	All agents in a client repo	Per §04 variable envelope × 1.25	Pause all agents; SEV-1 alert to Sprint Lead; manual release required.
Concurrent runs per agent	Agent-scoped	3	Queue incoming; reject after queue depth 10.
API rate limit (Anthropic)	Account-wide	Per tier	Honor 429; exponential backoff with jitter; SEV-3 if sustained > 5 min.

§ Why caps fire

A cap that never fires is probably too loose. A cap that fires weekly on the same agent is a spec problem, not a cap problem. Caps are diagnostic tools, not just safety rails — read breaches as signal.

Circuit-breaker pattern — the harness enforces; the sheet records

The harness checks caps before each API call and after each tool call. On breach, it halts, writes the audit row with outcome=fail and an error that names the cap, appends a row to alerts, and sends the email. A human clears the alert only after the root cause is understood — caps do not auto-reset.

§10

Agent Engineer checklist.

Pre-sprint · during build · at Gate 3 · at Gate 4

Pre-sprint (before Day 1)

Claude API key provisioned for the client’s scope; billing alarm at 50% / 80% / 100% of sprint envelope.
GitHub private repo created; CI configured; CURRENT.md symlink pattern scaffolded.
State sheet template forked; six canonical tabs in place; README populated.
sheets-state MCP server connected; lock-row smoke test passes.
Deployment target chosen (Vercel vs. Cloud Run) per §08 criteria.

During Build (Weeks 5–9)

Each new agent has a spec with explicit model_tier declaration before first run.
Each agent has at least five golden eval cases before Week 7.
Cost caps configured per §09 before the first production invocation.
Audit row schema validated end-to-end for every agent (no silent runs).
Weekly review of the metrics tab — variable cost tracking within the envelope.

At Gate 3 (Production Readiness)

All agents green on their golden evals at CURRENT.md.
Cost caps firing cleanly in a drill (deliberately trigger one; confirm harness halts and alert fires).
Rollback drill: ln -sf v-prev.md CURRENT.md, deploy, verify previous behavior. Under 15 minutes.
Secrets rotation: every key used in sprint was created or rotated inside the sprint window; nothing inherited from before.
Observability signoff: three targets confirmed populating for every agent.

At Gate 4 (Independence Verified)

Client holds their own Anthropic account; Studio key revoked from production path.
Client holds credentials for every MCP server the system uses; Studio credentials revoked.
Deployment target transferred to client’s org (Vercel team, GCP project).
Repo ownership transferred or mirrored per contract.
Post-mortem filed for every SEV-2+ event that occurred during the sprint.
One candidate pattern nominated for USP library inclusion.

§ Bottom line

An Agent Engineer who keeps this checklist will not need heroics. An Agent Engineer who skips it will spend the last week of every sprint fighting fires the checklist would have prevented. The checklist is the discipline that makes the runtime boring — which is the goal.