Inside An AI Agent: Planning, Tool Use, Memory, Constraints, And Verification A developer breaks down the five essential components of production-grade AI agents: planning, tool use, memory, constraints, and verification. The post argues that agent failures stem from workflow design around the model, not the model itself, and provides detailed code examples for each pillar. Have you noticed how every demo of "an AI agent" looks impressive in the video and falls apart the moment you ask a sharper question? The agent confidently does the wrong thing. It forgets what it just decided. It tries to call a tool that doesn't exist. It loops forever rewriting the same file. It calmly tells you the deployment succeeded when it didn't. These aren't failures of the model. They're failures of the workflow around the model. Because that's all an agent really is: a software workflow where a language model can pick the next step and call tools. The "intelligence" sits in the prompt and the orchestration around it, not in some secret agent-flavoured fairy dust. Strip the word "agent" away and you've got five pieces of plumbing: planning, tool use, memory, constraints, verification. Every production-grade agent stands or falls on those five. This is a long walk through each one. Not the marketing version. The kind of detail you actually need before you ship something that talks to your database. Before we touch any pillar individually, hold the whole loop in your head. A useful agent does roughly this on every turn: That's it. Every framework LangGraph, OpenAI Agents SDK, Claude Agent SDK, smolagents, whatever ships next month is a different shape of the same loop with different defaults. agent-loop.ts js async function runAgent goal: string, ctx: AgentContext { const state = ctx.startState goal ; for let step = 0; step < ctx.maxSteps; step++ { const decision = await ctx.model.decide state ; if decision.kind === "final" { const verified = await ctx.verifier.check decision.output, state ; if verified.ok return verified.output; state.observations.push verified.feedback ; continue; } if decision.kind === "tool" { ctx.guard.assertAllowed decision.toolName, decision.args ; const result = await ctx.tools.run decision.toolName, decision.args ; state.observations.push result ; state.memory.maybeStore decision, result ; } } throw new Error "agent: max steps exceeded" ; } Look at that loop carefully. Every interesting bug in agent systems lives in one of those five method calls: decide , check , assertAllowed , run , maybeStore . The rest is bookkeeping. Now let's pull each one apart. The single biggest difference between a one-shot prompt and an agent is that an agent thinks about what to do before it does it. A naive setup looks like this: js const reply = await model.complete User wants: ${goal}. Do it. ; The model sees the goal, jumps straight to action, and you're trusting its first instinct on a task that might need five steps. For trivial tasks this is fine. For anything multi-step it falls apart: the model picks a tool, gets a confusing result, panics, and starts hallucinating progress. A planning step changes the game: planning.ts js const plan = await model.complete { system: PLANNER SYSTEM, user: Goal: ${goal}\n\nProduce a short numbered plan. Each step must be either a tool call name + args or a direct answer. Do not execute anything yet. , } ; for const step of parsePlan plan { if step.kind === "tool" { const result = await tools.run step.name, step.args ; state.observations.push result ; } } You're asking the model to commit to a plan before it touches anything. The plan becomes auditable. You can show it to the user, log it, even let a different model review it. When something goes wrong, you have a record of what the agent intended versus what it did . There are two dominant planning styles, and they have very different ergonomics. Plan-then-execute is what we just wrote: the model produces a full plan up front, then a runner steps through it. Clean to debug, easy to log, hard to recover from when reality differs from the plan. The model didn't know the file would be 500MB. It didn't know the API would return a different schema. The plan is now wrong and the runner doesn't know how to adapt. ReAct reason + act interleaves thinking and acting. On every turn the model writes a short rationale, picks one tool call, observes the result, then writes the next rationale. The model can adjust as it learns. You pay for it in tokens and latency every turn pays the full context cost , but the agent stays honest about reality. react loop.py python def react step state : response = model.complete system=REACT SYSTEM, messages=state.messages, thought, action = parse react response state.log thought if action.kind == "final": return action.value observation = tools.run action.name, action.args state.messages.append {"role": "assistant", "content": response} state.messages.append {"role": "user", "content": f"Observation: {observation}"} return None You're not picking one style forever. A lot of useful agents do plan-then-execute with a re-plan trigger : the model writes a plan, the runner executes until it hits a surprise, then the runner asks the model for a new plan from the current state. Cheaper than pure ReAct, more adaptive than pure plan-then-execute. A common failure here is letting the model write plans that are too abstract. - Understand the user's request. - Gather relevant information. - Provide a helpful response. That plan is useless. It's the agent equivalent of a meeting agenda that says "discuss things". A useful plan names tools and arguments: - Call list files on ./src/api .- For each file matching handler.go , call read file .- Search for the string "github.com/old/dep" across results.- If matches found, call propose patch per file.- Run go test ./... to verify nothing broke.- Return summary with file count and test status. You enforce this shape with the system prompt and with examples. A line like "Each step must reference a tool from the tool list and include concrete arguments. Steps that say 'understand' or 'analyze' will be rejected." does more work than people expect. Without tools, an agent is a chatbot. With tools, it can do real things: read files, hit APIs, query databases, send messages, run commands. This is where every interesting capability comes from, and where the most dangerous failures happen. A tool, mechanically, is three things: a name, a JSON schema for its arguments, and a function the runtime calls when the model picks it. tool-definition.ts js const readFile = { name: "read file", description: "Read a file from the project. Use to inspect code or config. " + "Do not use for binary files or anything larger than 256KB.", parameters: { type: "object", properties: { path: { type: "string", description: "Path relative to the project root. No leading slash.", }, }, required: "path" , }, handler: async { path }: { path: string } = { if path.startsWith "/" throw new Error "absolute paths not allowed" ; return await fs.readFile join projectRoot, path , "utf8" ; }, }; Four things are doing the heavy lifting here, and three of them are not code. Models pick tools based on the description, not the name. A tool called read file with a vague description gets called for "find the user's email" because the model thinks "well, the email is probably in a file somewhere." A description that says "Read a single file when you already know its path. Do not use this for searching. Use grep repo for that." will save you a hundred wrong tool calls. Treat tool descriptions like little spec sheets. List what the tool is for, what it isn't for, the shape of valid input, the shape of valid output, and any edge cases the model needs to know. The JSON schema is your only contract. If the model invents an argument that isn't in the schema, your validator should reject the call before it reaches the handler. If a required field is missing, same. If a string is supposed to be one of "read", "write", "delete" and the model sends "REMOVE" , reject. Models are good but they freelance. Rejecting bad tool calls and feeding the error back to the model is better than accepting them: the model learns mid-loop and adjusts. js function validateToolCall call: ToolCall, schema: JSONSchema { const result = ajv.validate schema, call.args ; if result { return { ok: false, feedback: Tool call rejected. Errors: ${ajv.errorsText }. Retry with valid arguments. , }; } return { ok: true }; } There's a category boundary that frameworks often blur but you shouldn't: read tools and write tools are different animals. Read tools are cheap to retry. If list files returns nothing, you call it again with different args. No harm done. Write tools apply patch , send email , deploy service , run sql are expensive to undo, sometimes impossible. These deserve their own permission tier, their own logging, often their own approval step. We'll come back to this under constraints, but design tools knowing which side they sit on. When you have three tools, a switch statement is fine. When you have thirty, you want a tool registry that does schema validation, logging, timeout enforcement, and side-effect classification in one place. tool bus.py python class ToolBus: def init self : self.tools: dict str, Tool = {} def register self, tool: Tool : self.tools tool.name = tool async def run self, name: str, args: dict, , caller: AgentId : tool = self.tools.get name if tool is None: return ToolResult.error f"unknown tool: {name}" valid, err = tool.validate args if not valid: return ToolResult.error f"invalid args: {err}" async with self.metrics.time tool.name : try: value = await asyncio.wait for tool.handler args , timeout=tool.timeout s self.log caller, tool.name, args, value return ToolResult.ok value except asyncio.TimeoutError: return ToolResult.error f"tool timed out after {tool.timeout s}s" except Exception as exc: return ToolResult.error f"tool failed: {exc}" This single class is where you'll later add rate limits, audit trails, dry-run mode, and cost tracking. Build it on day one, even if it feels overkill: the alternative is bolting these concerns onto a sprawl of one-off tool handlers later, which is much worse. "Memory" is the most overloaded word in the agent vocabulary. It usually means at least three different mechanisms stitched together, and conflating them is a leading cause of "why did the agent forget what I just told it?" bugs. This is the conversation so far, plus tool results, plus the system prompt. It lives in the model's context window and disappears the moment the request returns. It's bounded by the model's context length and your wallet. Most "the agent forgot" complaints are about working memory. You ran two separate requests, the second one didn't include the relevant history, the model genuinely has no idea what you're talking about. The fix isn't a vector database. The fix is including the history. Inside a single agent run, the model often benefits from a place to "write notes to itself." This is just structured working memory: a list of observations, intermediate results, decisions and their reasoning. scratchpad.ts type ScratchpadEntry = | { kind: "observation"; toolName: string; result: unknown } | { kind: "decision"; rationale: string; choice: string } | { kind: "note"; text: string }; class Scratchpad { entries: ScratchpadEntry = ; add entry: ScratchpadEntry { this.entries.push entry ; } render maxTokens: number : string { return formatRecent this.entries, maxTokens ; } } The scratchpad is what you feed back into the model on the next turn. It's not magic. It's a structured replay of the agent's own work. The trick is keeping it short enough to fit. A scratchpad that just appends forever is how agents lose their minds on long tasks. This is what people usually mean when they say "memory": a store of facts the agent can recall in future conversations. User preferences, project context, the result of expensive computations, lessons from past failures. There are three popular shapes: | Shape | Looks like | Good for | Bad at | |---|---|---|---| | Key/value | A redis or a flat file | Stable facts user role, preferred language | Anything fuzzy or semantic | | Vector store | Pinecone, pgvector, Chroma | Semantic recall over notes/docs | Exact matches, freshness, contradictions | | File-based | A memory/ directory of markdown files | Auditable, editable, structured | Scale beyond a few thousand entries | File-based memory is underrated. Claude Code and a few other agent tools use exactly this: a directory of markdown files, indexed by a small MEMORY.md . The agent reads, writes, and edits files. There's nothing to migrate, you can git diff it, and the user can delete a memory by deleting a file. It scales worse than vectors but it's vastly easier to reason about, and the failure mode is "we couldn't find the right file" rather than "we semantically retrieved the wrong fact and the agent confidently used it." The cleanest design decision in this whole space: memory is just two more tools. recall query and remember fact . The model decides when to recall and when to remember, the same way it decides when to read a file or send a message. The alternative, a background process that magically injects "relevant memories" into every prompt, sounds convenient and is actually a nightmare to debug. You'll spend more time explaining why the agent randomly mentioned the user's old API key than you saved by automating retrieval. When memory is a tool, you can ask the agent to show its work. You ask "Why did you think the user wanted the Go example?" and the agent says "I called recall 'user language preference' and got back: 'Prefers Go for backend examples 2026-04-02 .'" That's an answerable question, in a way "the embeddings retrieved it" never is. Stale memory is worse than no memory. An agent that remembers your team uses Postgres when you switched to MySQL six months ago is going to produce confidently wrong advice forever. A few rules that have aged well: npm install here: pnpm is the package manager legacy package-lock.json exists but is unused "Agents in demos are unconstrained. They have full filesystem access, can call any tool, run any command, spend any number of tokens, take any number of turns. The demo works because the demonstrator is watching every step. Production agents are not watched. The constraints are what let you sleep. The pattern I've seen survive contact with reality is treating permissions as a separate first-class object. The agent core calls guard.assertAllowed toolName, args before every tool call, and the guard says yes or no based on a policy that you can read in one place. permissions.ts type Policy = { allowedTools: string ; pathAllowlist: RegExp ; pathDenylist: RegExp ; requireApprovalFor: string ; maxToolCallsPerRun: number; maxTokensPerRun: number; }; class Guard { constructor private policy: Policy, private state: RunState {} assertAllowed name: string, args: Record