Inside An AI Agent: Planning, Tool Use, Memory, Constraints, And Verification

A developer breaks down the five essential components of production-grade AI agents: planning, tool use, memory, constraints, and verification. The post argues that agent failures stem from workflow design around the model, not the model itself, and provides detailed code examples for each pillar.

Have you noticed how every demo of "an AI agent" looks impressive in the video and falls apart the moment you ask a sharper question? The agent confidently does the wrong thing. It forgets what it just decided. It tries to call a tool that doesn't exist. It loops forever rewriting the same file. It calmly tells you the deployment succeeded when it didn't. These aren't failures of the model. They're failures of the workflow around the model. Because that's all an agent really is: a software workflow where a language model can pick the next step and call tools. The "intelligence" sits in the prompt and the orchestration around it, not in some secret agent-flavoured fairy dust. Strip the word "agent" away and you've got five pieces of plumbing: planning, tool use, memory, constraints, verification. Every production-grade agent stands or falls on those five. This is a long walk through each one. Not the marketing version. The kind of detail you actually need before you ship something that talks to your database. Before we touch any pillar individually, hold the whole loop in your head. A useful agent does roughly this on every turn: That's it. Every framework LangGraph, OpenAI Agents SDK, Claude Agent SDK, smolagents, whatever ships next month is a different shape of the same loop with different defaults. agent-loop.ts js async function runAgent goal: string, ctx: AgentContext { const state = ctx.startState goal ; for let step = 0; step < ctx.maxSteps; step++ { const decision = await ctx.model.decide state ; if decision.kind === "final" { const verified = await ctx.verifier.check decision.output, state ; if verified.ok return verified.output; state.observations.push verified.feedback ; continue; } if decision.kind === "tool" { ctx.guard.assertAllowed decision.toolName, decision.args ; const result = await ctx.tools.run decision.toolName, decision.args ; state.observations.push result ; state.memory.maybeStore decision, result ; } } throw new Error "agent: max steps exceeded" ; } Look at that loop carefully. Every interesting bug in agent systems lives in one of those five method calls: decide , check , assertAllowed , run , maybeStore . The rest is bookkeeping. Now let's pull each one apart. The single biggest difference between a one-shot prompt and an agent is that an agent thinks about what to do before it does it. A naive setup looks like this: js const reply = await model.complete User wants: ${goal}. Do it. ; The model sees the goal, jumps straight to action, and you're trusting its first instinct on a task that might need five steps. For trivial tasks this is fine. For anything multi-step it falls apart: the model picks a tool, gets a confusing result, panics, and starts hallucinating progress. A planning step changes the game: planning.ts js const plan = await model.complete { system: PLANNER SYSTEM, user: Goal: ${goal}\n\nProduce a short numbered plan. Each step must be either a tool call name + args or a direct answer. Do not execute anything yet. , } ; for const step of parsePlan plan { if step.kind === "tool" { const result = await tools.run step.name, step.args ; state.observations.push result ; } } You're asking the model to commit to a plan before it touches anything. The plan becomes auditable. You can show it to the user, log it, even let a different model review it. When something goes wrong, you have a record of what the agent intended versus what it did . There are two dominant planning styles, and they have very different ergonomics. Plan-then-execute is what we just wrote: the model produces a full plan up front, then a runner steps through it. Clean to debug, easy to log, hard to recover from when reality differs from the plan. The model didn't know the file would be 500MB. It didn't know the API would return a different schema. The plan is now wrong and the runner doesn't know how to adapt. ReAct reason + act interleaves thinking and acting. On every turn the model writes a short rationale, picks one tool call, observes the result, then writes the next rationale. The model can adjust as it learns. You pay for it in tokens and latency every turn pays the full context cost , but the agent stays honest about reality. react loop.py python def react step state : response = model.complete system=REACT SYSTEM, messages=state.messages, thought, action = parse react response state.log thought if action.kind == "final": return action.value observation = tools.run action.name, action.args state.messages.append {"role": "assistant", "content": response} state.messages.append {"role": "user", "content": f"Observation: {observation}"} return None You're not picking one style forever. A lot of useful agents do plan-then-execute with a re-plan trigger : the model writes a plan, the runner executes until it hits a surprise, then the runner asks the model for a new plan from the current state. Cheaper than pure ReAct, more adaptive than pure plan-then-execute. A common failure here is letting the model write plans that are too abstract. - Understand the user's request. - Gather relevant information. - Provide a helpful response. That plan is useless. It's the agent equivalent of a meeting agenda that says "discuss things". A useful plan names tools and arguments: - Call list files on ./src/api .- For each file matching handler.go , call read file .- Search for the string "github.com/old/dep" across results.- If matches found, call propose patch per file.- Run go test ./... to verify nothing broke.- Return summary with file count and test status. You enforce this shape with the system prompt and with examples. A line like "Each step must reference a tool from the tool list and include concrete arguments. Steps that say 'understand' or 'analyze' will be rejected." does more work than people expect. Without tools, an agent is a chatbot. With tools, it can do real things: read files, hit APIs, query databases, send messages, run commands. This is where every interesting capability comes from, and where the most dangerous failures happen. A tool, mechanically, is three things: a name, a JSON schema for its arguments, and a function the runtime calls when the model picks it. tool-definition.ts js const readFile = { name: "read file", description: "Read a file from the project. Use to inspect code or config. " + "Do not use for binary files or anything larger than 256KB.", parameters: { type: "object", properties: { path: { type: "string", description: "Path relative to the project root. No leading slash.", }, }, required: "path" , }, handler: async { path }: { path: string } = { if path.startsWith "/" throw new Error "absolute paths not allowed" ; return await fs.readFile join projectRoot, path , "utf8" ; }, }; Four things are doing the heavy lifting here, and three of them are not code. Models pick tools based on the description, not the name. A tool called read file with a vague description gets called for "find the user's email" because the model thinks "well, the email is probably in a file somewhere." A description that says "Read a single file when you already know its path. Do not use this for searching. Use grep repo for that." will save you a hundred wrong tool calls. Treat tool descriptions like little spec sheets. List what the tool is for, what it isn't for, the shape of valid input, the shape of valid output, and any edge cases the model needs to know. The JSON schema is your only contract. If the model invents an argument that isn't in the schema, your validator should reject the call before it reaches the handler. If a required field is missing, same. If a string is supposed to be one of "read", "write", "delete" and the model sends "REMOVE" , reject. Models are good but they freelance. Rejecting bad tool calls and feeding the error back to the model is better than accepting them: the model learns mid-loop and adjusts. js function validateToolCall call: ToolCall, schema: JSONSchema { const result = ajv.validate schema, call.args ; if result { return { ok: false, feedback: Tool call rejected. Errors: ${ajv.errorsText }. Retry with valid arguments. , }; } return { ok: true }; } There's a category boundary that frameworks often blur but you shouldn't: read tools and write tools are different animals. Read tools are cheap to retry. If list files returns nothing, you call it again with different args. No harm done. Write tools apply patch , send email , deploy service , run sql are expensive to undo, sometimes impossible. These deserve their own permission tier, their own logging, often their own approval step. We'll come back to this under constraints, but design tools knowing which side they sit on. When you have three tools, a switch statement is fine. When you have thirty, you want a tool registry that does schema validation, logging, timeout enforcement, and side-effect classification in one place. tool bus.py python class ToolBus: def init self : self.tools: dict str, Tool = {} def register self, tool: Tool : self.tools tool.name = tool async def run self, name: str, args: dict, , caller: AgentId : tool = self.tools.get name if tool is None: return ToolResult.error f"unknown tool: {name}" valid, err = tool.validate args if not valid: return ToolResult.error f"invalid args: {err}" async with self.metrics.time tool.name : try: value = await asyncio.wait for tool.handler args , timeout=tool.timeout s self.log caller, tool.name, args, value return ToolResult.ok value except asyncio.TimeoutError: return ToolResult.error f"tool timed out after {tool.timeout s}s" except Exception as exc: return ToolResult.error f"tool failed: {exc}" This single class is where you'll later add rate limits, audit trails, dry-run mode, and cost tracking. Build it on day one, even if it feels overkill: the alternative is bolting these concerns onto a sprawl of one-off tool handlers later, which is much worse. "Memory" is the most overloaded word in the agent vocabulary. It usually means at least three different mechanisms stitched together, and conflating them is a leading cause of "why did the agent forget what I just told it?" bugs. This is the conversation so far, plus tool results, plus the system prompt. It lives in the model's context window and disappears the moment the request returns. It's bounded by the model's context length and your wallet. Most "the agent forgot" complaints are about working memory. You ran two separate requests, the second one didn't include the relevant history, the model genuinely has no idea what you're talking about. The fix isn't a vector database. The fix is including the history. Inside a single agent run, the model often benefits from a place to "write notes to itself." This is just structured working memory: a list of observations, intermediate results, decisions and their reasoning. scratchpad.ts type ScratchpadEntry = | { kind: "observation"; toolName: string; result: unknown } | { kind: "decision"; rationale: string; choice: string } | { kind: "note"; text: string }; class Scratchpad { entries: ScratchpadEntry = ; add entry: ScratchpadEntry { this.entries.push entry ; } render maxTokens: number : string { return formatRecent this.entries, maxTokens ; } } The scratchpad is what you feed back into the model on the next turn. It's not magic. It's a structured replay of the agent's own work. The trick is keeping it short enough to fit. A scratchpad that just appends forever is how agents lose their minds on long tasks. This is what people usually mean when they say "memory": a store of facts the agent can recall in future conversations. User preferences, project context, the result of expensive computations, lessons from past failures. There are three popular shapes: | Shape | Looks like | Good for | Bad at | |---|---|---|---| | Key/value | A redis or a flat file | Stable facts user role, preferred language | Anything fuzzy or semantic | | Vector store | Pinecone, pgvector, Chroma | Semantic recall over notes/docs | Exact matches, freshness, contradictions | | File-based | A memory/ directory of markdown files | Auditable, editable, structured | Scale beyond a few thousand entries | File-based memory is underrated. Claude Code and a few other agent tools use exactly this: a directory of markdown files, indexed by a small MEMORY.md . The agent reads, writes, and edits files. There's nothing to migrate, you can git diff it, and the user can delete a memory by deleting a file. It scales worse than vectors but it's vastly easier to reason about, and the failure mode is "we couldn't find the right file" rather than "we semantically retrieved the wrong fact and the agent confidently used it." The cleanest design decision in this whole space: memory is just two more tools. recall query and remember fact . The model decides when to recall and when to remember, the same way it decides when to read a file or send a message. The alternative, a background process that magically injects "relevant memories" into every prompt, sounds convenient and is actually a nightmare to debug. You'll spend more time explaining why the agent randomly mentioned the user's old API key than you saved by automating retrieval. When memory is a tool, you can ask the agent to show its work. You ask "Why did you think the user wanted the Go example?" and the agent says "I called recall 'user language preference' and got back: 'Prefers Go for backend examples 2026-04-02 .'" That's an answerable question, in a way "the embeddings retrieved it" never is. Stale memory is worse than no memory. An agent that remembers your team uses Postgres when you switched to MySQL six months ago is going to produce confidently wrong advice forever. A few rules that have aged well: npm install here: pnpm is the package manager legacy package-lock.json exists but is unused "Agents in demos are unconstrained. They have full filesystem access, can call any tool, run any command, spend any number of tokens, take any number of turns. The demo works because the demonstrator is watching every step. Production agents are not watched. The constraints are what let you sleep. The pattern I've seen survive contact with reality is treating permissions as a separate first-class object. The agent core calls guard.assertAllowed toolName, args before every tool call, and the guard says yes or no based on a policy that you can read in one place. permissions.ts type Policy = { allowedTools: string ; pathAllowlist: RegExp ; pathDenylist: RegExp ; requireApprovalFor: string ; maxToolCallsPerRun: number; maxTokensPerRun: number; }; class Guard { constructor private policy: Policy, private state: RunState {} assertAllowed name: string, args: Record<string, unknown { if this.policy.allowedTools.includes name { throw new GuardError tool not in allowlist: ${name} ; } if this.state.toolCalls = this.policy.maxToolCallsPerRun { throw new GuardError "max tool calls exceeded" ; } if typeof args.path === "string" { const path = args.path; if this.policy.pathDenylist.some re = re.test path { throw new GuardError path on denylist: ${path} ; } if this.policy.pathAllowlist.some re = re.test path { throw new GuardError path not in allowlist: ${path} ; } } if this.policy.requireApprovalFor.includes name { throw new ApprovalRequired name, args ; } } } This is unglamorous code and it does more for your safety story than any clever prompt. Almost every long-lived agent system converges on the same four: Tool allowlist. The agent can call only these named tools. Anything else is rejected before the handler runs. This stops "I'll just write a delete everything tool real quick" patterns and tightens the surface area enormously. Iteration budget. A hard cap on tool calls per run. Agents will absolutely loop forever if you let them: re-reading the same file, retrying a failing API, "thinking more about it." Pick a number based on your task complexity and bail when you hit it. Better to fail loudly than to silently rack up an API bill. Token/cost budget. Independent of iterations, count tokens. Long tool outputs eat budget faster than you'd expect. When you hit the cap, the agent stops and reports. Approval gate for side effects. Any tool that changes the world outside the agent's sandbox sends an email, hits prod, files a PR, charges a card goes through a separate ApprovalRequired path. The agent proposes; a human or a stricter automated check disposes. The pattern for the fourth one is worth lingering on, because it's where teams over-engineer the most. You don't need a fancy approval workflow. You need a way to pause the agent, surface what it wants to do, and let a human respond. approval.py python class Approval: async def request self, action: str, args: dict, , justification: str : ticket = await self.store.create action=action, args=args, justification=justification, status="pending" await self.notifier.send channel="approvals", text=f"Agent wants to {action} with {args}. Why: {justification}", ticket id=ticket.id, return ticket That's the whole approval system. Ticket in a database, message in Slack or email, or wherever your humans live , a way to look up the result. The agent calls it, the run pauses or returns a "waiting for approval" status , a human clicks yes or no. You can add SLAs, escalation, batching later, but the simple shape ships in a day and covers 90% of what you need. The model is confident by default. It will say the code works. It will say the deployment succeeded. It will say the test passed. None of those statements should be trusted on their own. Verification is the difference between an agent that claims to have done the work and an agent that can prove it did. Every serious production agent has a verification step somewhere: sometimes it's external run the tests , sometimes it's another agent an independent reviewer , sometimes it's the same agent checking its own work against an explicit rubric. For coding agents, this is almost always running a real tool against the real artifact. verify-code-change.ts async function verifyCodeChange change: ProposedChange { await applyToWorktree change ; const typeCheck = await run "tsc", "--noEmit" ; if typeCheck.exitCode == 0 return failure "type check failed", typeCheck ; const lint = await run "npm", "run", "lint" ; if lint.exitCode == 0 return failure "lint failed", lint ; const tests = await run "npm", "test", "--", "--run" ; if tests.exitCode == 0 return failure "tests failed", tests ; return success ; } This is unglamorous and it's the single best lever you have for agent reliability. If the agent claims it fixed a bug, run the test . If the agent claims it refactored something safely, run the typechecker . The model is allowed to lie. The compiler isn't. For non-coding agents, the analog is whatever your domain has: a schema check on the JSON output, a regex on a date format, an actual API call to confirm the resource exists, a SQL query to verify the row was inserted. The simplest self-critique is one extra model call: "Here is what you produced. Here is the rubric. List every place the output fails the rubric. If none, say 'OK'." self critique.py php async def critique output: str, rubric: str - CritiqueResult: response = await model.complete system=CRITIC SYSTEM, user=f"Output:\n{output}\n\nRubric:\n{rubric}\n\nList violations or say OK.", if response.strip == "OK": return CritiqueResult.ok return CritiqueResult.violations parse response This works better than you'd expect, and worse than people pretend. It catches the obvious stuff: missing fields, factual contradictions, style violations the rubric explicitly names. It misses subtle reasoning errors because the same model that made the error is now grading it. Useful, not sufficient. When the cost is justified, the pattern is to have a different agent or at least a different prompt with a different role review the output. The reviewer doesn't see the original chain of thought, only the final artifact and the original goal. It's much closer to a human code review and catches a different class of mistake. This is also where the "judge agent" pattern lives. The judge has a strict rubric, refuses to be polite, and returns a structured verdict. You don't ship the output until the judge approves. The most underrated move is making verification part of the loop, not a final gate. If verifyCodeChange fails with "test X failed: expected 200, got 500", you feed that observation back to the model and let it try again. Same with critique violations. Same with judge rejections. verify-and-retry.ts js for let attempt = 0; attempt < ctx.maxVerifyAttempts; attempt++ { const output = await ctx.agent.produce ; const verdict = await ctx.verifier.check output ; if verdict.ok return output; ctx.state.observations.push { kind: "verification failed", detail: verdict.feedback, } ; } throw new Error "verification failed after retries" ; The model that ignored the test on attempt 1 sees the actual error on attempt 2 and usually fixes it. That's not magic. It's just letting the model see what went wrong, which the unconstrained version of itself never bothered to check. If you skim the pillars individually they look like five separate features. They're not. They're five views of the same loop, and the interesting design choices are about how they interact. A plan that ignores constraints is a plan the agent can't execute. A tool registry without verification produces actions you can't audit. Memory without hygiene corrupts every future plan. Verification without retries is a wall; verification inside a loop is a teacher. Constraints without observability are a black box that fails silently in production. The teams whose agents work in production have all stopped chasing "smarter prompts" and started shipping plumbing. Better tool descriptions. Tighter schemas. A real permissions object. An honest budget. A verifier that actually runs the tests. A memory tier that the user can grep. None of that is sexy. It's all just software engineering. Which is exactly the point: once you stop expecting magic, the work becomes legible, the failures become diagnosable, and the agent stops being a mysterious black box and starts being a system you maintain like any other. The model gets to be the clever part. Everything around it is your job, and that's where the difference between a demo and a product really lives.