Umbra Group / Studio / Agent Runtime
← Master v1.0 · L3
DocumentAgent Runtime
ClassificationInternal
Versionv1.0 · 2026.04
OwnerAgent Engineer
Studio Ops Kit · L3 Build & Runtime · Internal

Agent Runtime.

The technical substrate on which Umbra Studio builds and runs agent systems. Claude API and model tier selection, the Agent SDK, MCP server inventory and patterns, prompt versioning, state-machine infrastructure, observability, deployment targets, and cost caps. This is Layer 3 of the stack map — the machinery between a spec and a production run.

Paired with the Lighthouse Agent Spec Sheet template (USO-LH-11) and the Governance Runbook (USO-LH-07). Owned by the Agent Engineer; read by everyone who touches agent code, specs, or deploys.

§01

How to use this doc.

Audience · reading path · what this doc does not cover

This is the L3 operating manual. It is read front-to-back once per quarter and consulted point-by-point during every sprint. It tells the Agent Engineer which model to call, where state lives, how prompts are versioned, how observability lands on the state sheet, and which deployment target to use. It is the first doc to update when the underlying tooling changes — Anthropic ships a new model, a new MCP server enters the inventory, a workflow graduates from Sheets to a real database.

This doc is not a tutorial on Claude or the API — Anthropic’s documentation is the canonical reference for that. This is the Studio’s opinionated slice: how we choose, what we pair things with, and what we refuse to do.

What this doc covers · what it does not

Covered hereLives elsewhere
Claude model selection (Opus / Sonnet / Haiku)The actual prompt content for a given agent — lives in the client repo & agent spec
Agent SDK usage patterns we standardize onAPI reference & SDK changelog — anthropic docs
MCP server inventory & “build vs. buy” rulesInternal knowledge about a client’s specific integration — client runbook
Google Sheets + Apps Script as state machineWhen to hand off governance to the client — Handoff Playbook (USO-LH-06)
Prompt versioning & regression testingPrompt engineering craft — Anthropic’s prompt engineering guide
Observability (what to log, where)The six governance components in depth — Governance Operations (USO-LH-07)
Deployment targets & when to use eachVercel / GCP provider docs
Cost caps & rate limits we enforceBudget envelope & per-sprint variable — Master (USO-ST-01) §04
§ Principle refresh
L3 decisions inherit every principle from the master. When a choice here contradicts P-01 (own primitives, rent commodities), P-04 (governance is a column), or P-05 (reversible by default), the principle wins.
§02

Runtime at a glance.

Three layers · same substrate across every sprint

Every Lighthouse Sprint deploys the same shape, regardless of the client’s domain. There are three runtime layers and they never vary: the model layer, the orchestration layer, and the state layer. Clients change; wrapper shapes change; this does not.

ModelL3 / top

Model layer.

Where reasoning happens. Claude API, three tiers. All agent decisions, extractions, and synthesis run here. No reasoning gets embedded in glue code.

OrchestrationL3 / mid

Orchestration layer.

Where the model calls tools, loops, and hands control around. Claude Agent SDK + MCP servers + a small, boring harness of our own. This layer is where we earn the right to own patterns.

StateL3 / bottom

State layer.

Where the workflow lives between runs. Google Sheet + Apps Script, until a named pressure graduates the workflow to Airtable or a real database. USP-009 governs when to move.

The default sprint runtime — what gets deployed on every engagement

ComponentDefaultPurposeWhen we swap
ModelClaude Opus / Sonnet / HaikuReasoning for every agent runNever swap the vendor; do swap the tier per run.
OrchestrationClaude Agent SDK (TypeScript)Tool calls, loops, safety railsRaw-API harness only for throwaway scripts.
ToolsMCP servers (official first, custom second)External system access (Slack, Notion, Sheets, HTTP, etc.)Custom MCP when no official server exists.
StateGoogle Sheets + Apps ScriptWorkflow queue, governance columns, audit logAirtable / Supabase when §05 pressures fire.
Code hostingGitHub private repo per clientSource, review, CINever mixed across clients.
DeploymentVercel Functions (simple) · Cloud Run (long-running)Where agent code runsPer §08 criteria.
Secrets1Password + Vercel / GCP secret managerAPI keys, client credsNever env vars committed to repo.
ObservabilityState sheet columns + Cloud Logging + email alertsMonitoring, alerting, audit trailExternal SaaS only if client-mandated.
§ Opinionated default
A new sprint starts with this stack already assumed. We only deviate if a sprint produces a written exception citing a named pressure. Deviations are one-per-sprint, not one-per-agent.
§03

Model tiers.

Opus · Sonnet · Haiku · when each one · how to decide at spec-write time

Every agent spec declares a default model tier in its front matter. That choice is reviewed at the Redesign gate. Tier is the single largest cost and latency lever in the entire runtime — the wrong default blows the variable-cost line within a week.

Tier A · Opus 4.6

Planning & synthesis.

Highest reasoning ceiling. Use for workflows where the model must hold a lot in its head at once and produce a single high-stakes output. Slower. Most expensive.

Best for · spec drafting · ICPs · post-mortems · research synthesis
Tier B · Sonnet 4.6

Execution & agents.

Default for most production agent runs. Strong reasoning with usable latency and reasonable cost. Handles structured output, tool calls, long chains. The Studio’s default answer when an engineer is not sure.

Best for · end-to-end agents · orchestration · reviewers · most client work
Tier C · Haiku 4.5

Routing & guards.

Fast and cheap. Use for tight-loop, high-volume calls: classification, routing, format checks, pre-condition guards (USP-014). If you are about to call Opus 10 times in sequence, a Haiku router in front of it is usually the right move.

Best for · classifiers · routers · format-check guards · sanity filters

Decision table — which tier at spec-write time

If the agent…Default tierWhy
Runs at most a few times per weekSonnetCost pressure is low; reasoning headroom matters more.
Runs dozens of times per run (fan-out inside one invocation)Haiku for the fan-out, Sonnet for the orchestratorCost multiplies with fan-out.
Writes a document a human will readSonnet (default) or Opus (stakes)Quality bar matters; not cost-bound.
Classifies / routes / picks one-of-NHaikuClassification does not need Sonnet’s reasoning.
Produces structured JSON from messy inputSonnetHas the judgment for schema repair; Haiku is often too brittle.
Holds > 40 pages of contextOpusLong-context reasoning is a reason to pay for Tier A.
Needs to be audited by a reviewer agentSonnet for bothMatched-tier reviewers catch each other’s drift; mismatched ones thrash.
Runs inside a pre-condition guard (USP-014)HaikuGuards run constantly; cost pressure is real.
Writes code a human will deployOpusThe one place we refuse to trade quality for cost.
Synthesizes interview transcripts into themesOpusPlanning/synthesis territory.
§ Tier escalation
If a Haiku agent fails its evals, escalate to Sonnet before tweaking prompts for a week. If a Sonnet agent fails quality bar in production, escalate to Opus as a diagnostic — if Opus succeeds, the task was under-tiered; if Opus also fails, it is a spec problem, not a model problem.

Cost-to-latency-to-quality intuition — rough numbers

TierRelative costTypical latencyQuality ceiling
Haiku 4.5~1–3 sGood-enough for structured tasks; brittle on ambiguity.
Sonnet 4.6~5–10×~3–8 sStudio default — rarely the wrong choice.
Opus 4.6~25–50×~6–20 sHighest reasoning; stop and ask whether it’s needed.

Actual numbers depend on prompt length, output length, tool-call fan-out, and Anthropic’s current pricing — check the docs before rebuilding a variable-cost model. The shape of the tradeoff is what matters for spec-write decisions.

§04

MCP servers.

Inventory · build-vs-buy rules · scope · where each server runs

MCP is how agents reach the world. A Studio agent almost never calls an external API directly — it calls a Model Context Protocol server that wraps the API with a tool interface, an auth scope, and observable boundaries. This section names the servers we rely on, the rules for when to build a custom one, and where they run.

Standard inventory — servers we reach for first

ServerPurposeSourceAuth
slackPost messages, read threads, react, scheduled sendsOfficialOAuth, per workspace
gmailRead, search, draft, labelOfficialOAuth, per account
google-calendarRead / create / move eventsOfficialOAuth, per account
google-driveRead doc/sheet/slide metadata & contentOfficialOAuth, per account
sheets-stateOur own wrapper over Sheets API — state-machine primitives (lock row, write status, append audit)Custom (Studio)Service account
notionPage create/edit, database queryOfficialIntegration token
linearIssue create/update, cycle readOfficialAPI key
githubPR, issue, branch, fileOfficialPAT or app
fetchConstrained HTTP GET with allow-list of hostsOfficialNone (host allow-list only)
wp-publishWordPress draft / publish — used by Indietheka pipelinesCustom (Studio)App password
album-artCover-art fetch (Album of the Year lookup)Custom (Studio)None

Build vs. buy — when we write a custom MCP server

We default to an official or community-published MCP server whenever one exists and fits. We write a custom server only when the task is load-bearing enough to own or when no public server covers it.

  1. No server exists. The target system has no official or community MCP, and wrapping the API is cheaper than using a human every day.
  2. The existing server is too broad. Official servers often expose hundreds of tools; an agent works best with 5–15 scoped tools. A custom wrapper pares down.
  3. We need Studio-specific primitives on top. Example: sheets-state exists because raw Sheets access does not have “lock row + update status + append audit” as a single atomic operation.
  4. Audit / observability requirements are not met. If we need to log every call with a correlation ID and the existing server will not, we wrap it.
  5. Three sprints have reused the custom wrapper we built ad hoc. Promote to a shared internal MCP server in the Studio catalog.
§ Anti-pattern
Do not build a custom MCP server for a one-off. If an agent will use a tool twice in its life, put the HTTP call in a single TypeScript helper and move on. Custom MCP servers have a maintenance tax — only pay it for load-bearing integrations.

Where MCP servers run

Pattern A
Co-located with the agent.
Default

The MCP server ships in the same repo and runs in the same process as the agent harness — typically as a stdio-transport server. Simple; no network hop; easy to debug.

Use when
  • The server is lightweight and stateless.
  • Only one agent uses it.
  • The agent runs on a single deployment target.
Pattern B
Remote HTTP MCP.
For shared or multi-agent

The MCP server runs as its own HTTP service — Cloud Run, Vercel Function, or a long-running container. Multiple agents across multiple repos connect over HTTP(S).

Use when
  • Two or more agents need the same tool surface.
  • The server has meaningful state or caches.
  • Credentials are scoped to the server, not the agents.
Pattern C
Client-side MCP.
For human-in-the-loop only

The MCP server runs on a human operator’s machine — e.g., Claude Desktop with a local MCP server for file system access or Apple Notes. Production agents do not use this pattern.

Use when
  • A human is driving the session interactively.
  • The tool requires access to local-only resources (clipboard, native app state).
  • The risk envelope makes remote execution undesirable.
§05

State-machine infrastructure.

USP-009 in practice · Sheets + Apps Script as the default substrate · graduation path

The spreadsheet-as-state-machine pattern (USP-009) is not a nostalgia play. Google Sheets + Apps Script is the Studio’s default workflow substrate because it gets four things right that bespoke databases get wrong on sprint timelines: the client can see the state live, atomic-enough row locking exists, audit history is free, and rollback is a copy-paste.

The canonical sheet shape — six tabs, every time

TabPurposeWho writes
queueLive work items — each row is one workflow instance; status column drives agent selectionHumans + agents
configFlags, thresholds, cost caps, allow-lists. Agents read; humans edit.Humans only
auditAppend-only. Every agent run appends a row with correlation ID, agent name, model, tokens, dollars, outcome.Agents only
alertsActive alerts (SEV, severity, first-seen, last-ack). Humans clear rows.Agents write, humans clear
metricsRolling counters: runs today, errors today, dollars today, dollars MTD.Agents only
readmeOne-page README explaining the sheet. For the client. Never empty.Humans only

Row-locking & concurrency discipline

Apps Script has row-level concurrency gotchas. The Studio’s sheets-state MCP server wraps a lock-row primitive that uses LockService.getDocumentLock() plus a lock_token column. Every agent that claims a row writes its token; every release clears it. Stale tokens older than a timeout are reaped by a janitor trigger.

// canonical row-claim pattern (Apps Script) function claimRow(sheet, rowIdx, agentId) { const lock = LockService.getDocumentLock(); lock.waitLock(5000); try { const token = sheet.getRange(rowIdx, TOKEN_COL).getValue(); if (token && !isStale(token)) return null; const newToken = `${agentId}:${Date.now()}`; sheet.getRange(rowIdx, TOKEN_COL).setValue(newToken); return newToken; } finally { lock.releaseLock(); } }

Graduation path — when to leave Sheets

Graduation is a move to Airtable (hosted, richer API, still visible to humans) or Supabase / Postgres (real DB, structured schemas, SQL, RLS). One of the five §07 pressures from the master must apply. We do not graduate because a spreadsheet “feels unprofessional.”

Step 1
Airtable as intermediate.
Most common first graduation

Airtable keeps the “humans can see the state” property while giving better concurrency and an API that does not suffer from Apps Script’s execution-time limits. Most workflows stay here.

Graduate to Airtable when
  • Queue tab exceeds ~5,000 rows and interactive editing becomes slow.
  • More than 3 agents need to write concurrently.
  • A trigger-driven flow has hit Apps Script quota limits.
Alternatives to Airtable
  • Notion databases — if the client already lives in Notion, acceptable substitute.
  • Baserow / NocoDB — open-source alt if data residency is a requirement.
Step 2
Supabase / Postgres.
Terminal graduation

When the workflow has outgrown what a human-editable table can hold — strict schemas, constraints, transactional semantics, row-level security, real query needs. The state machine loses its “client can see it live” property unless we pair it with a thin dashboard (a Lovable or Retool page pointed at the DB).

Graduate to Supabase when
  • You need real foreign-key constraints and multi-table transactions.
  • Query latency must be sub-second.
  • Compliance mandates data residency, row-level security, or audit at the DB level.
  • Volume is past ~100k rows and growing.
Alternatives to Supabase
  • Neon — Postgres with branching; nicer for dev / preview environments.
  • PlanetScale — MySQL-compatible if the client is in that ecosystem.
  • Turso — edge SQLite if the workload is read-heavy and distributed.
§ Reversal rule
Every graduation must have a documented reversal path. If Supabase vanishes tomorrow, the queue table exports back to a Sheet in under 30 minutes. Reversibility is cheap to design up front, expensive to retrofit.
§06

Prompt versioning.

Where prompts live · how we version them · regression discipline

Prompts are code. We treat them like code — they live in the repo, they version with semver, they have tests, they land through PRs. The Studio’s refusal to let prompts live in a prompt-store SaaS is deliberate: vendor lock-in on the most valuable artifact in an agent system is unacceptable (P-01).

File layout · one prompt per file, one file per version

// repo layout prompts/ album-reviewer/ v1.0.0.md // frozen once promoted v1.1.0.md v1.1.1.md CURRENT.md // symlink -> active version seo-brief/ v2.0.0.md CURRENT.md evals/ album-reviewer/ cases/ golden-001.md golden-002.md runner.ts

Each versioned file has front matter: model_tier, temperature, max_tokens, tools, created_at, author, changelog. The body is the prompt itself — in Markdown, with sectioned instructions and examples as headings.

The CURRENT.md symlink is what the agent runtime loads. Rolling back is ln -sf v1.1.0.md CURRENT.md and a deploy. This is the cheapest rollback we have.

Semver discipline for prompts

ChangeBumpExample
Fix a typo, clarify a sentencepatch1.1.0 → 1.1.1
Add a new section or capability without removing anyminor1.1.1 → 1.2.0
Remove or restructure instructions, change output schema, change model tiermajor1.2.0 → 2.0.0
Swap model vendor (we don’t, but hypothetically)majoralways major

Regression discipline — evals against golden cases

Every prompt with ≥ 10 production runs has at least five golden cases in evals/<agent>/cases/. A golden case is an input + an expected shape of output (not a string match — a judge-prompt or a structural check). The eval runner replays every golden case against CURRENT.md and reports pass/fail.

Rules: major bumps require all goldens pass. Minor bumps require goldens in the affected section to pass. Patch bumps require no regressions on any golden. Any golden marked load-bearing in its front matter is non-negotiable at every bump.

§ The hard rule
A promoted prompt version file is immutable. If v1.1.0 has a bug, create v1.1.1. Never edit v1.1.0 in place — rollback depends on the previous version still behaving like it did yesterday.
§07

Observability.

What lands where · log schema · why the state sheet is still the dashboard

Observability is P-04 in action: governance is a column, not a tool. Every agent run produces signals that land in one of three places — the state sheet, the cloud logs, and an alert channel. We instrument every agent against this three-target discipline; nothing gets promoted to production without it.

The three observability targets

Target 1Primary

State sheet.

Every run appends a row to the audit tab. Every rollup refreshes the metrics tab. Every alert writes to alerts. The client opens one URL and sees the system.

Target 2Secondary

Cloud logs.

Google Cloud Logging (or Vercel logs) captures raw request/response, stack traces, and latencies that do not belong in a sheet. Queryable by correlation ID. For engineers, not for clients.

Target 3Push

Alert channel.

Email (primary) + Slack DM (secondary) for SEV-1/2 conditions. Silent SEV-3 rows append to the alerts tab for the next review. Nothing SEV-1 is silent.

Canonical audit row schema

FieldTypeRequiredExample
run_idstringyes2026-04-21T14:03:11.r7qk
agentstringyesalbum-reviewer
versionstringyes1.1.1
modelstringyesclaude-sonnet-4-6
input_tokensintyes4,218
output_tokensintyes1,903
usddecimalyes0.0382
duration_msintyes5,441
outcomeenumyesok · retry · fail · skip
sevenumif not oksev-1 · sev-2 · sev-3
errorstringif failrate_limit_429
notesstringoptionalFree text.
§ The rule
No audit row, no production run. Our harness refuses to complete an invocation without writing its audit row. We would rather fail loud than run silent.
§08

Deployment targets.

Where agent code runs · when each target is right

Three runtime targets handle every workload we build. Each has a narrow niche. Misplacing a workload between them is the second most common cost-drift cause after model-tier choice.

Target A
Vercel Functions.
Default for agents & webhooks

Short-lived, stateless, HTTP-triggered. Co-located with web properties. Trivially cheap. The Studio’s first reach.

Use when
  • Agent invocations complete in under ~60 seconds.
  • Triggered by HTTP, cron, or webhook.
  • Stateless between runs (state lives on the sheet / DB).
Alternatives
  • Cloudflare Workers — lower latency edge, tighter resource ceilings.
  • Deno Deploy — if a workload is strictly Deno-native.
Target B
Google Cloud Run.
For long-running or heavy

Container-based, scales to zero, minute-plus execution, up to 8 GB memory, 2 GiB temp disk, HTTP or Pub/Sub triggers. The Studio’s pick when Vercel’s 60-second ceiling is not enough.

Use when
  • Agent invocations may run several minutes (Opus on long contexts; fan-out chains).
  • Custom MCP servers running as HTTP services.
  • Workloads needing more than ~1 GB memory.
Alternatives
  • Fly.io — similar niche; better for persistent-connection workloads.
  • AWS Lambda + Fargate — if the client already lives in AWS.
  • Render / Railway — simpler DX at the cost of some observability depth.
Target C
Apps Script.
Only for sheet-bound triggers

Google-owned, runs inside the sheet, perfect for cheap triggers. Never put reasoning here — Apps Script is glue, not compute. A trigger calls out to a Vercel Function or Cloud Run service that calls Claude.

Use when
  • You need an onEdit / onFormSubmit / time-based trigger to bootstrap an agent run.
  • A thin web-app endpoint is needed to accept writes back into the sheet.
  • Cheap cron inside the sheet’s own quota envelope.
Alternatives
  • Zapier / Make — if a non-developer client owns the trigger fabric.
  • Vercel cron — when the trigger is not sheet-anchored.
§ Target discipline
Default to Vercel Functions. Escalate to Cloud Run only when a specific constraint bites. Use Apps Script only for the last-inch trigger. Two targets is the norm; three is a warning sign.
§09

Cost caps & rate limits.

Per-run · per-day · per-sprint · the circuit breakers that keep us safe

Runaway agent loops are the single most expensive accidental behavior in the Studio’s runtime. Every agent is wrapped in three levels of cap: per-run, per-day, per-sprint. When any cap fires, the harness halts the agent and writes a SEV-2 alert. Caps are non-negotiable.

CapScopeDefaultWhat happens on breach
Tokens per runSingle invocationmax 200k in + 32k outHarness aborts; audit row with outcome=fail, error=token_cap.
Dollars per runSingle invocation$3.00 (Sonnet) · $12 (Opus)Harness aborts; audit row + SEV-3 alert.
Tool calls per runSingle invocation40Harness aborts; usually indicates agent is stuck in a loop.
Dollars per agent per dayAgent-scoped rolling 24 h$50 default, per-spec overridePause new invocations; existing runs complete; SEV-2 alert.
Dollars per sprint totalAll agents in a client repoPer §04 variable envelope × 1.25Pause all agents; SEV-1 alert to Sprint Lead; manual release required.
Concurrent runs per agentAgent-scoped3Queue incoming; reject after queue depth 10.
API rate limit (Anthropic)Account-widePer tierHonor 429; exponential backoff with jitter; SEV-3 if sustained > 5 min.
§ Why caps fire
A cap that never fires is probably too loose. A cap that fires weekly on the same agent is a spec problem, not a cap problem. Caps are diagnostic tools, not just safety rails — read breaches as signal.

Circuit-breaker pattern — the harness enforces; the sheet records

The harness checks caps before each API call and after each tool call. On breach, it halts, writes the audit row with outcome=fail and an error that names the cap, appends a row to alerts, and sends the email. A human clears the alert only after the root cause is understood — caps do not auto-reset.

§10

Agent Engineer checklist.

Pre-sprint · during build · at Gate 3 · at Gate 4

Pre-sprint (before Day 1)

  1. Claude API key provisioned for the client’s scope; billing alarm at 50% / 80% / 100% of sprint envelope.
  2. GitHub private repo created; CI configured; CURRENT.md symlink pattern scaffolded.
  3. State sheet template forked; six canonical tabs in place; README populated.
  4. sheets-state MCP server connected; lock-row smoke test passes.
  5. Deployment target chosen (Vercel vs. Cloud Run) per §08 criteria.

During Build (Weeks 5–9)

  1. Each new agent has a spec with explicit model_tier declaration before first run.
  2. Each agent has at least five golden eval cases before Week 7.
  3. Cost caps configured per §09 before the first production invocation.
  4. Audit row schema validated end-to-end for every agent (no silent runs).
  5. Weekly review of the metrics tab — variable cost tracking within the envelope.

At Gate 3 (Production Readiness)

  1. All agents green on their golden evals at CURRENT.md.
  2. Cost caps firing cleanly in a drill (deliberately trigger one; confirm harness halts and alert fires).
  3. Rollback drill: ln -sf v-prev.md CURRENT.md, deploy, verify previous behavior. Under 15 minutes.
  4. Secrets rotation: every key used in sprint was created or rotated inside the sprint window; nothing inherited from before.
  5. Observability signoff: three targets confirmed populating for every agent.

At Gate 4 (Independence Verified)

  1. Client holds their own Anthropic account; Studio key revoked from production path.
  2. Client holds credentials for every MCP server the system uses; Studio credentials revoked.
  3. Deployment target transferred to client’s org (Vercel team, GCP project).
  4. Repo ownership transferred or mirrored per contract.
  5. Post-mortem filed for every SEV-2+ event that occurred during the sprint.
  6. One candidate pattern nominated for USP library inclusion.
§ Bottom line
An Agent Engineer who keeps this checklist will not need heroics. An Agent Engineer who skips it will spend the last week of every sprint fighting fires the checklist would have prevented. The checklist is the discipline that makes the runtime boring — which is the goal.