AI agents as explicit state machines

AI agents should be built as explicit state machines rather than monolithic prompts, according to a new analysis of agent architecture failures. A single prompt that handles routing, extraction, tool selection, formatting, and recovery creates entanglement, untestability, cost amplification, and opacity — problems that a state machine solves by separating work into explicit states with typed transitions, validators, and recovery paths. The approach allows per-state unit testing, deterministic error handling with bounded retries and dead-letter paths, and per-state model routing to reduce costs.

Why AI Agents Should Be State Machines A monolithic agent prompt hides routing, extraction, tool selection, formatting, and recovery inside one probabilistic call. The antidote is the formal state machine: explicit states, typed transitions, validators, and recovery paths. A monolithic agent prompt is convenient until one prompt starts doing five jobs at once: routing, extraction, tool selection, formatting, and recovery. When the agent fails, the prompt does not tell you which job failed. It only returns another opaque output. I think this is where many agent systems need less prompting and more ordinary software architecture. A state machine separates the runtime into explicit states, typed transitions, validators, and recovery paths. The LLM still handles local ambiguity inside a state; the surrounding system controls what state can happen next. That boundary matters because LLM calls are stochastic. If a state emits malformed output, the next state should not have to guess what went wrong. The transition should fail loudly, attach a useful error, retry only when the failure is recoverable, and route to a human or dead-letter path when it is not. TL;DR — Key Takeaways: - A monolithic agent prompt hides routing, extraction, tool selection, formatting, and recovery inside one probabilistic call. - A state machine makes those concerns explicit: states do work, transitions choose the next step, and contracts validate handoffs. - Each state can be tested and observed separately; malformed output fails at the boundary instead of contaminating later steps. - Model routing becomes possible once work is split by state, but economics should remain secondary to reliability. - Error handling becomes architecture: deterministic validators, bounded retries, fallbacks, and dead-letter paths. The Monolithic Prompt Failure Taxonomy Entanglement. When a single prompt encodes routing logic, domain knowledge, and output formatting simultaneously, changing one dimension requires re-testing all others. I have seen agent prompts grow until nobody could change one instruction without revalidating the whole behavior surface. Untestability. You cannot write a unit test for an LLM prompt that does five things at once. You can only run end-to-end evaluations and observe whether the emergent behavior stays stable. As I wrote in The End of Determinism https://arizenai.com/end-of-determinism/ , the stochastic nature of LLMs makes this problem structural — but a state machine gives you isolation boundaries for your evaluation suites. Cost amplification. A monolithic agent sends the same context, tool schemas, and domain rules to the same model for every step. Classification, extraction, validation, and synthesis may need different levels of intelligence, but the monolith pays for the whole bundle every time. Opacity. When an agent misbehaves, a monolithic prompt conflates routing, extraction, tool selection, and formatting failures into a single black box. The state machine emits structured logs at every transition. The failing state is always identifiable. | Failure Mode | Monolithic prompt | State machine | |---|---|---| | Entanglement | All logic in one prompt — change anything, risk everything | Each state owns one concern — changes are surgical | | Untestability | Only end-to-end evals possible | Per-state unit tests + contract validation | | Cost | Frontier model for every call | Per-state model routing — budget where possible | | Debuggability | Failure location unknown | Structured logs at every transition | | Error handling | Ask the same prompt to recover | Retry, fallback, dead-letter per transition | Anatomy of an Agent State Machine A production agent state machine has three primitives: states , transitions , and contracts . States do work. Transitions encode control flow. Contracts enforce the shape of data at every boundary. python from future import annotations from enum import Enum from typing import Any from pydantic import BaseModel, Field class TicketIntent str, Enum : BILLING = "billing" TECHNICAL = "technical" GENERAL = "general" UNKNOWN = "unknown" class ClassifyOutput BaseModel : intent: TicketIntent confidence: float = Field ge=0.0, le=1.0 reasoning: str class ExtractOutput BaseModel : entities: dict str, Any ticket id: str | None = None customer tier: str = "standard" class StateResult BaseModel : next state: str payload: dict str, Any --- State definitions --- async def classify intent user message: str, llm client: Any - ClassifyOutput: """State 1: Classify — uses a small classification model.""" response = await llm client.complete model="small-classifier-model", system="Classify the user message into one of: billing, technical, general. " "Return JSON with intent, confidence 0-1 , and reasoning.", user=user message, response model=ClassifyOutput, return response async def extract entities user message: str, intent: TicketIntent, llm client: Any - ExtractOutput: """State 2: Extract — uses a constrained extraction model.""" response = await llm client.complete model="structured-extraction-model", system=f"Extract structured entities from a {intent.value} support ticket. " "Return JSON with entities dict, optional ticket id, customer tier.", user=user message, response model=ExtractOutput, return response async def route to handler classification: ClassifyOutput, entities: ExtractOutput - StateResult: """Transition logic: pure code, no LLM needed.""" if classification.confidence < 0.7: return StateResult next state="human escalation", payload={"reason": "low confidence", "score": classification.confidence}, handler map = { TicketIntent.BILLING: "billing specialist", TicketIntent.TECHNICAL: "technical specialist", TicketIntent.GENERAL: "general responder", TicketIntent.UNKNOWN: "human escalation", } return StateResult next state=handler map classification.intent , payload=entities.model dump , Every state function has a typed return. Every transition consumes a typed input. Pydantic enforces these contracts at runtime — if a state produces malformed output, the error surfaces immediately at the boundary, not three states downstream. Pro Tip: Start with five states, not fifty. A small pipeline — CLASSIFY , EXTRACT , ROUTE , EXECUTE , SYNTHESIZE — is often enough to expose the pattern. Add states only when a production failure reveals that two concerns are entangled in a single state. Model Routing Becomes Possible Once your agent is a state machine, each state can use the smallest model or deterministic function that can handle that state. The name matters less than the boundary: simple states should not pay for the model capacity needed by complex states. CLASSIFY INTENT — small model or deterministic classifier. EXTRACT ENTITIES — constrained extraction model or parser. VALIDATE SCHEMA — schema validation, no LLM. EXECUTE TOOL — API call or database query, no LLM. SYNTHESIZE RESPONSE — stronger model when nuance is actually needed. This can reduce blended inference cost when state complexity differs, but that is a consequence rather than the main point. The more important change is that every state gets an explicit owner, contract, trace, and failure mode. Per-state model routing is only possible after the architecture has separated tasks into explicit states. Without that boundary, every call tends to inherit the cost and context of the most complex step. Error Handling as Architecture Linear chains make every model call another chance to produce an invalid intermediate result. If you assume independent 90% step reliability, three steps produce 72.9% end-to-end success. That number is illustrative, not a benchmark, but the multiplication is the point: reliability compounds downward when every step must succeed on the first attempt. The architectural response is not "retry forever." It is deterministic validation, bounded retries, and a circuit breaker. If an output violates a schema, the validator returns a precise error and the generator gets another attempt. If the retry budget is exhausted, the state machine routes to a human queue or dead-letter path instead of silently returning bad output. Error handling in an agent state machine is architectural, not linguistic. Retry policies, model fallbacks, and dead-letter queues are encoded as transition logic between states — patterns that are impossible to express in a monolithic prompt. Implementation: Frameworks and Primitives LangGraph encodes states as nodes and transitions as edges in a directed graph. Burr takes a similar approach with explicit state annotations. Both enforce the core discipline: one function per state, typed inputs and outputs, deterministic transition logic. But the pattern does not require a framework — the code above is plain Python with Pydantic and asyncio . The critical implementation detail is the contract between states. Each state must declare: what it receives a Pydantic model , what it returns a Pydantic model , and what transitions are valid from its output. This is the Agentic Contract IDL https://arizenai.com/agentic-contract-idl/ . Without typed contracts, you have a state machine in name only. Pro Tip: Use Pydantic model validator for cross-field invariants.Production agents fail on semantic invariants — a high confidence score paired with an UNKNOWN intent, or a negative billing amount. Use @model validator mode="after" to encode domain constraints at every state boundary. This Is Not a New Idea Finite-state machines, workflow engines, circuit breakers, and durable execution are old ideas. So are orchestration frameworks such as LangGraph , Burr , Temporal, Step Functions, and XState. The point is not that state machines are new. The point is that stochastic model calls make old boundaries useful again. When a program calls an LLM, the transition is no longer an ordinary function call. It is a probabilistic attempt to produce a value that must be checked before the rest of the system treats it as fact. That is where the classic vocabulary becomes practical: state, transition, guard, validator, retry budget, and escalation path. Where State Machines Are Overkill Not every LLM feature needs this structure. A single-shot summarizer, an exploratory chat surface, or a prototype with unclear requirements may be worse if forced into a graph too early. State machines are strongest when the system has tool calls, retries, partial failure, handoffs, permissions, or outputs that must satisfy a contract. They also do not remove model uncertainty. They contain it. The model can still be fuzzy inside a state, but the boundary around that state should be explicit enough for the rest of the system to decide what happens next. If the boundary cannot be specified, adding boxes and arrows will only create the appearance of rigor. The Design Question The practical question is not "How can I improve my giant prompt?" It is "What is the smallest state this prompt should be responsible for?" Applied recursively, that question produces a system where failures are loud and localized instead of silent and systemic. A monolithic prompt is an act of faith in a single model's ability to juggle unbounded complexity. A state machine is the older engineering discipline of giving each component a boundary, a contract, and a recovery path. The pattern is not new. The urgency is.