{"slug": "inside-an-ai-agent-planning-tool-use-memory-constraints-and-verification", "title": "Inside An AI Agent: Planning, Tool Use, Memory, Constraints, And Verification", "summary": "A developer breaks down the five essential components of production-grade AI agents: planning, tool use, memory, constraints, and verification. The post argues that agent failures stem from workflow design around the model, not the model itself, and provides detailed code examples for each pillar.", "body_md": "Have you noticed how every demo of \"an AI agent\" looks impressive in the video and falls apart the moment you ask a sharper question?\n\nThe agent confidently does the wrong thing. It forgets what it just decided. It tries to call a tool that doesn't exist. It loops forever rewriting the same file. It calmly tells you the deployment succeeded when it didn't.\n\nThese aren't failures of the model. They're failures of the workflow around the model.\n\nBecause that's all an agent really is: a software workflow where a language model can pick the next step and call tools. The \"intelligence\" sits in the prompt and the orchestration around it, not in some secret agent-flavoured fairy dust. Strip the word \"agent\" away and you've got five pieces of plumbing: planning, tool use, memory, constraints, verification. Every production-grade agent stands or falls on those five.\n\nThis is a long walk through each one. Not the marketing version. The kind of detail you actually need before you ship something that talks to your database.\n\nBefore we touch any pillar individually, hold the whole loop in your head.\n\nA useful agent does roughly this on every turn:\n\nThat's it. Every framework (LangGraph, OpenAI Agents SDK, Claude Agent SDK, smolagents, whatever ships next month) is a different shape of the same loop with different defaults.\n\n`agent-loop.ts`\n\n``` js\nasync function runAgent(goal: string, ctx: AgentContext) {\n  const state = ctx.startState(goal);\n\n  for (let step = 0; step < ctx.maxSteps; step++) {\n    const decision = await ctx.model.decide(state);\n\n    if (decision.kind === \"final\") {\n      const verified = await ctx.verifier.check(decision.output, state);\n      if (verified.ok) return verified.output;\n      state.observations.push(verified.feedback);\n      continue;\n    }\n\n    if (decision.kind === \"tool\") {\n      ctx.guard.assertAllowed(decision.toolName, decision.args);\n      const result = await ctx.tools.run(decision.toolName, decision.args);\n      state.observations.push(result);\n      state.memory.maybeStore(decision, result);\n    }\n  }\n\n  throw new Error(\"agent: max steps exceeded\");\n}\n```\n\nLook at that loop carefully. Every interesting bug in agent systems lives in one of those five method calls: `decide`\n\n, `check`\n\n, `assertAllowed`\n\n, `run`\n\n, `maybeStore`\n\n. The rest is bookkeeping.\n\nNow let's pull each one apart.\n\nThe single biggest difference between a one-shot prompt and an agent is that an agent thinks about *what to do* before it does it.\n\nA naive setup looks like this:\n\n``` js\nconst reply = await model.complete(`User wants: ${goal}. Do it.`);\n```\n\nThe model sees the goal, jumps straight to action, and you're trusting its first instinct on a task that might need five steps. For trivial tasks this is fine. For anything multi-step it falls apart: the model picks a tool, gets a confusing result, panics, and starts hallucinating progress.\n\nA planning step changes the game:\n\n`planning.ts`\n\n``` js\nconst plan = await model.complete({\n  system: PLANNER_SYSTEM,\n  user: `Goal: ${goal}\\n\\nProduce a short numbered plan. Each step must be either a tool call (name + args) or a direct answer. Do not execute anything yet.`,\n});\n\nfor (const step of parsePlan(plan)) {\n  if (step.kind === \"tool\") {\n    const result = await tools.run(step.name, step.args);\n    state.observations.push(result);\n  }\n}\n```\n\nYou're asking the model to commit to a plan before it touches anything. The plan becomes auditable. You can show it to the user, log it, even let a different model review it. When something goes wrong, you have a record of what the agent *intended* versus what it *did*.\n\nThere are two dominant planning styles, and they have very different ergonomics.\n\n**Plan-then-execute** is what we just wrote: the model produces a full plan up front, then a runner steps through it. Clean to debug, easy to log, hard to recover from when reality differs from the plan. The model didn't know the file would be 500MB. It didn't know the API would return a different schema. The plan is now wrong and the runner doesn't know how to adapt.\n\n**ReAct** (reason + act) interleaves thinking and acting. On every turn the model writes a short rationale, picks one tool call, observes the result, then writes the next rationale. The model can adjust as it learns. You pay for it in tokens and latency (every turn pays the full context cost), but the agent stays honest about reality.\n\n`react_loop.py`\n\n``` python\ndef react_step(state):\n    response = model.complete(\n        system=REACT_SYSTEM,\n        messages=state.messages,\n    )\n    thought, action = parse_react(response)\n    state.log(thought)\n\n    if action.kind == \"final\":\n        return action.value\n\n    observation = tools.run(action.name, action.args)\n    state.messages.append({\"role\": \"assistant\", \"content\": response})\n    state.messages.append({\"role\": \"user\", \"content\": f\"Observation: {observation}\"})\n    return None\n```\n\nYou're not picking one style forever. A lot of useful agents do **plan-then-execute with a re-plan trigger**: the model writes a plan, the runner executes until it hits a surprise, then the runner asks the model for a new plan from the current state. Cheaper than pure ReAct, more adaptive than pure plan-then-execute.\n\nA common failure here is letting the model write plans that are too abstract.\n\n- Understand the user's request.\n- Gather relevant information.\n- Provide a helpful response.\n\nThat plan is useless. It's the agent equivalent of a meeting agenda that says \"discuss things\". A useful plan names tools and arguments:\n\n- Call\n`list_files`\n\non`./src/api`\n\n.- For each file matching\n`*_handler.go`\n\n, call`read_file`\n\n.- Search for the string\n`\"github.com/old/dep\"`\n\nacross results.- If matches found, call\n`propose_patch`\n\nper file.- Run\n`go test ./...`\n\nto verify nothing broke.- Return summary with file count and test status.\n\nYou enforce this shape with the system prompt and with examples. A line like *\"Each step must reference a tool from the tool list and include concrete arguments. Steps that say 'understand' or 'analyze' will be rejected.\"* does more work than people expect.\n\nWithout tools, an agent is a chatbot. With tools, it can do real things: read files, hit APIs, query databases, send messages, run commands. This is where every interesting capability comes from, and where the most dangerous failures happen.\n\nA tool, mechanically, is three things: a name, a JSON schema for its arguments, and a function the runtime calls when the model picks it.\n\n`tool-definition.ts`\n\n``` js\nconst readFile = {\n  name: \"read_file\",\n  description:\n    \"Read a file from the project. Use to inspect code or config. \" +\n    \"Do not use for binary files or anything larger than 256KB.\",\n  parameters: {\n    type: \"object\",\n    properties: {\n      path: {\n        type: \"string\",\n        description: \"Path relative to the project root. No leading slash.\",\n      },\n    },\n    required: [\"path\"],\n  },\n  handler: async ({ path }: { path: string }) => {\n    if (path.startsWith(\"/\")) throw new Error(\"absolute paths not allowed\");\n    return await fs.readFile(join(projectRoot, path), \"utf8\");\n  },\n};\n```\n\nFour things are doing the heavy lifting here, and three of them are not code.\n\nModels pick tools based on the description, not the name. A tool called `read_file`\n\nwith a vague description gets called for \"find the user's email\" because the model thinks \"well, the email is probably in a file somewhere.\" A description that says *\"Read a single file when you already know its path. Do not use this for searching. Use grep_repo for that.\"* will save you a hundred wrong tool calls.\n\nTreat tool descriptions like little spec sheets. List what the tool is for, what it isn't for, the shape of valid input, the shape of valid output, and any edge cases the model needs to know.\n\nThe JSON schema is your only contract. If the model invents an argument that isn't in the schema, your validator should reject the call before it reaches the handler. If a required field is missing, same. If a string is supposed to be one of `[\"read\", \"write\", \"delete\"]`\n\nand the model sends `\"REMOVE\"`\n\n, reject.\n\nModels are good but they freelance. Rejecting bad tool calls and feeding the error back to the model is *better* than accepting them: the model learns mid-loop and adjusts.\n\n``` js\nfunction validateToolCall(call: ToolCall, schema: JSONSchema) {\n  const result = ajv.validate(schema, call.args);\n  if (!result) {\n    return {\n      ok: false,\n      feedback: `Tool call rejected. Errors: ${ajv.errorsText()}. Retry with valid arguments.`,\n    };\n  }\n  return { ok: true };\n}\n```\n\nThere's a category boundary that frameworks often blur but you shouldn't: **read tools** and **write tools** are different animals.\n\nRead tools are cheap to retry. If `list_files`\n\nreturns nothing, you call it again with different args. No harm done.\n\nWrite tools (`apply_patch`\n\n, `send_email`\n\n, `deploy_service`\n\n, `run_sql`\n\n) are expensive to undo, sometimes impossible. These deserve their own permission tier, their own logging, often their own approval step. We'll come back to this under constraints, but design tools knowing which side they sit on.\n\nWhen you have three tools, a switch statement is fine. When you have thirty, you want a tool registry that does schema validation, logging, timeout enforcement, and side-effect classification in one place.\n\n`tool_bus.py`\n\n``` python\nclass ToolBus:\n    def __init__(self):\n        self.tools: dict[str, Tool] = {}\n\n    def register(self, tool: Tool):\n        self.tools[tool.name] = tool\n\n    async def run(self, name: str, args: dict, *, caller: AgentId):\n        tool = self.tools.get(name)\n        if tool is None:\n            return ToolResult.error(f\"unknown tool: {name}\")\n\n        valid, err = tool.validate(args)\n        if not valid:\n            return ToolResult.error(f\"invalid args: {err}\")\n\n        async with self.metrics.time(tool.name):\n            try:\n                value = await asyncio.wait_for(\n                    tool.handler(args), timeout=tool.timeout_s\n                )\n                self.log(caller, tool.name, args, value)\n                return ToolResult.ok(value)\n            except asyncio.TimeoutError:\n                return ToolResult.error(f\"tool timed out after {tool.timeout_s}s\")\n            except Exception as exc:\n                return ToolResult.error(f\"tool failed: {exc}\")\n```\n\nThis single class is where you'll later add rate limits, audit trails, dry-run mode, and cost tracking. Build it on day one, even if it feels overkill: the alternative is bolting these concerns onto a sprawl of one-off tool handlers later, which is much worse.\n\n\"Memory\" is the most overloaded word in the agent vocabulary. It usually means at least three different mechanisms stitched together, and conflating them is a leading cause of \"why did the agent forget what I just told it?\" bugs.\n\nThis is the conversation so far, plus tool results, plus the system prompt. It lives in the model's context window and disappears the moment the request returns. It's bounded by the model's context length and your wallet.\n\nMost \"the agent forgot\" complaints are about working memory. You ran two separate requests, the second one didn't include the relevant history, the model genuinely has no idea what you're talking about. The fix isn't a vector database. The fix is including the history.\n\nInside a single agent run, the model often benefits from a place to \"write notes to itself.\" This is just structured working memory: a list of observations, intermediate results, decisions and their reasoning.\n\n`scratchpad.ts`\n\n```\ntype ScratchpadEntry =\n  | { kind: \"observation\"; toolName: string; result: unknown }\n  | { kind: \"decision\"; rationale: string; choice: string }\n  | { kind: \"note\"; text: string };\n\nclass Scratchpad {\n  entries: ScratchpadEntry[] = [];\n\n  add(entry: ScratchpadEntry) {\n    this.entries.push(entry);\n  }\n\n  render(maxTokens: number): string {\n    return formatRecent(this.entries, maxTokens);\n  }\n}\n```\n\nThe scratchpad is what you feed back into the model on the next turn. It's not magic. It's a structured replay of the agent's own work. The trick is keeping it short enough to fit. A scratchpad that just appends forever is how agents lose their minds on long tasks.\n\nThis is what people usually mean when they say \"memory\": a store of facts the agent can recall in future conversations. User preferences, project context, the result of expensive computations, lessons from past failures.\n\nThere are three popular shapes:\n\n| Shape | Looks like | Good for | Bad at |\n|---|---|---|---|\n| Key/value | A redis or a flat file | Stable facts (user role, preferred language) | Anything fuzzy or semantic |\n| Vector store | Pinecone, pgvector, Chroma | Semantic recall over notes/docs | Exact matches, freshness, contradictions |\n| File-based | A `memory/` directory of markdown files |\nAuditable, editable, structured | Scale beyond a few thousand entries |\n\n**File-based memory is underrated.** Claude Code and a few other agent tools use exactly this: a directory of markdown files, indexed by a small `MEMORY.md`\n\n. The agent reads, writes, and edits files. There's nothing to migrate, you can `git diff`\n\nit, and the user can delete a memory by deleting a file. It scales worse than vectors but it's vastly easier to reason about, and the failure mode is \"we couldn't find the right file\" rather than \"we semantically retrieved the wrong fact and the agent confidently used it.\"\n\nThe cleanest design decision in this whole space: **memory is just two more tools.** `recall(query)`\n\nand `remember(fact)`\n\n. The model decides when to recall and when to remember, the same way it decides when to read a file or send a message.\n\nThe alternative, a background process that magically injects \"relevant memories\" into every prompt, sounds convenient and is actually a nightmare to debug. You'll spend more time explaining why the agent randomly mentioned the user's old API key than you saved by automating retrieval.\n\nWhen memory is a tool, you can ask the agent to show its work. You ask *\"Why did you think the user wanted the Go example?\"* and the agent says *\"I called recall('user language preference') and got back: 'Prefers Go for backend examples (2026-04-02).'\"* That's an answerable question, in a way \"the embeddings retrieved it\" never is.\n\nStale memory is worse than no memory. An agent that remembers your team uses Postgres when you switched to MySQL six months ago is going to produce confidently wrong advice forever.\n\nA few rules that have aged well:\n\n`npm install`\n\nhere: `pnpm`\n\nis the package manager (legacy `package-lock.json`\n\nexists but is unused)\"Agents in demos are unconstrained. They have full filesystem access, can call any tool, run any command, spend any number of tokens, take any number of turns. The demo works because the demonstrator is watching every step.\n\nProduction agents are not watched. The constraints are what let you sleep.\n\nThe pattern I've seen survive contact with reality is treating permissions as a separate first-class object. The agent core calls `guard.assertAllowed(toolName, args)`\n\nbefore every tool call, and the guard says yes or no based on a policy that you can read in one place.\n\n`permissions.ts`\n\n```\ntype Policy = {\n  allowedTools: string[];\n  pathAllowlist: RegExp[];\n  pathDenylist: RegExp[];\n  requireApprovalFor: string[];\n  maxToolCallsPerRun: number;\n  maxTokensPerRun: number;\n};\n\nclass Guard {\n  constructor(private policy: Policy, private state: RunState) {}\n\n  assertAllowed(name: string, args: Record<string, unknown>) {\n    if (!this.policy.allowedTools.includes(name)) {\n      throw new GuardError(`tool not in allowlist: ${name}`);\n    }\n    if (this.state.toolCalls >= this.policy.maxToolCallsPerRun) {\n      throw new GuardError(\"max tool calls exceeded\");\n    }\n    if (typeof args.path === \"string\") {\n      const path = args.path;\n      if (this.policy.pathDenylist.some((re) => re.test(path))) {\n        throw new GuardError(`path on denylist: ${path}`);\n      }\n      if (!this.policy.pathAllowlist.some((re) => re.test(path))) {\n        throw new GuardError(`path not in allowlist: ${path}`);\n      }\n    }\n    if (this.policy.requireApprovalFor.includes(name)) {\n      throw new ApprovalRequired(name, args);\n    }\n  }\n}\n```\n\nThis is unglamorous code and it does more for your safety story than any clever prompt.\n\nAlmost every long-lived agent system converges on the same four:\n\n**Tool allowlist.** The agent can call only these named tools. Anything else is rejected before the handler runs. This stops \"I'll just write a `delete_everything`\n\ntool real quick\" patterns and tightens the surface area enormously.\n\n**Iteration budget.** A hard cap on tool calls per run. Agents will absolutely loop forever if you let them: re-reading the same file, retrying a failing API, \"thinking more about it.\" Pick a number based on your task complexity and bail when you hit it. Better to fail loudly than to silently rack up an API bill.\n\n**Token/cost budget.** Independent of iterations, count tokens. Long tool outputs eat budget faster than you'd expect. When you hit the cap, the agent stops and reports.\n\n**Approval gate for side effects.** Any tool that changes the world outside the agent's sandbox (sends an email, hits prod, files a PR, charges a card) goes through a separate `ApprovalRequired`\n\npath. The agent proposes; a human (or a stricter automated check) disposes.\n\nThe pattern for the fourth one is worth lingering on, because it's where teams over-engineer the most.\n\nYou don't need a fancy approval workflow. You need a way to pause the agent, surface what it wants to do, and let a human respond.\n\n`approval.py`\n\n``` python\nclass Approval:\n    async def request(self, action: str, args: dict, *, justification: str):\n        ticket = await self.store.create(\n            action=action, args=args, justification=justification, status=\"pending\"\n        )\n        await self.notifier.send(\n            channel=\"approvals\",\n            text=f\"Agent wants to {action} with {args}. Why: {justification}\",\n            ticket_id=ticket.id,\n        )\n        return ticket\n```\n\nThat's the whole approval system. Ticket in a database, message in Slack (or email, or wherever your humans live), a way to look up the result. The agent calls it, the run pauses (or returns a \"waiting for approval\" status), a human clicks yes or no. You can add SLAs, escalation, batching later, but the simple shape ships in a day and covers 90% of what you need.\n\nThe model is confident by default. It will say the code works. It will say the deployment succeeded. It will say the test passed. None of those statements should be trusted on their own.\n\nVerification is the difference between an agent that *claims* to have done the work and an agent that *can prove* it did. Every serious production agent has a verification step somewhere: sometimes it's external (run the tests), sometimes it's another agent (an independent reviewer), sometimes it's the same agent checking its own work against an explicit rubric.\n\nFor coding agents, this is almost always running a real tool against the real artifact.\n\n`verify-code-change.ts`\n\n```\nasync function verifyCodeChange(change: ProposedChange) {\n  await applyToWorktree(change);\n\n  const typeCheck = await run(\"tsc\", [\"--noEmit\"]);\n  if (typeCheck.exitCode !== 0) return failure(\"type check failed\", typeCheck);\n\n  const lint = await run(\"npm\", [\"run\", \"lint\"]);\n  if (lint.exitCode !== 0) return failure(\"lint failed\", lint);\n\n  const tests = await run(\"npm\", [\"test\", \"--\", \"--run\"]);\n  if (tests.exitCode !== 0) return failure(\"tests failed\", tests);\n\n  return success();\n}\n```\n\nThis is unglamorous and it's the single best lever you have for agent reliability. If the agent claims it fixed a bug, *run the test*. If the agent claims it refactored something safely, *run the typechecker*. The model is allowed to lie. The compiler isn't.\n\nFor non-coding agents, the analog is whatever your domain has: a schema check on the JSON output, a regex on a date format, an actual API call to confirm the resource exists, a SQL query to verify the row was inserted.\n\nThe simplest self-critique is one extra model call: \"Here is what you produced. Here is the rubric. List every place the output fails the rubric. If none, say 'OK'.\"\n\n`self_critique.py`\n\n``` php\nasync def critique(output: str, rubric: str) -> CritiqueResult:\n    response = await model.complete(\n        system=CRITIC_SYSTEM,\n        user=f\"Output:\\n{output}\\n\\nRubric:\\n{rubric}\\n\\nList violations or say OK.\",\n    )\n    if response.strip() == \"OK\":\n        return CritiqueResult.ok()\n    return CritiqueResult.violations(parse(response))\n```\n\nThis works better than you'd expect, and worse than people pretend. It catches the obvious stuff: missing fields, factual contradictions, style violations the rubric explicitly names. It misses subtle reasoning errors because the same model that made the error is now grading it. Useful, not sufficient.\n\nWhen the cost is justified, the pattern is to have a *different* agent (or at least a different prompt with a different role) review the output. The reviewer doesn't see the original chain of thought, only the final artifact and the original goal. It's much closer to a human code review and catches a different class of mistake.\n\nThis is also where the \"judge agent\" pattern lives. The judge has a strict rubric, refuses to be polite, and returns a structured verdict. You don't ship the output until the judge approves.\n\nThe most underrated move is making verification part of the loop, not a final gate. If `verifyCodeChange`\n\nfails with \"test X failed: expected 200, got 500\", you feed that observation back to the model and let it try again. Same with critique violations. Same with judge rejections.\n\n`verify-and-retry.ts`\n\n``` js\nfor (let attempt = 0; attempt < ctx.maxVerifyAttempts; attempt++) {\n  const output = await ctx.agent.produce();\n  const verdict = await ctx.verifier.check(output);\n  if (verdict.ok) return output;\n  ctx.state.observations.push({\n    kind: \"verification_failed\",\n    detail: verdict.feedback,\n  });\n}\nthrow new Error(\"verification failed after retries\");\n```\n\nThe model that ignored the test on attempt 1 sees the actual error on attempt 2 and usually fixes it. That's not magic. It's just letting the model see what went wrong, which the unconstrained version of itself never bothered to check.\n\nIf you skim the pillars individually they look like five separate features. They're not. They're five views of the same loop, and the interesting design choices are about how they interact.\n\nA plan that ignores constraints is a plan the agent can't execute. A tool registry without verification produces actions you can't audit. Memory without hygiene corrupts every future plan. Verification without retries is a wall; verification inside a loop is a teacher. Constraints without observability are a black box that fails silently in production.\n\nThe teams whose agents work in production have all stopped chasing \"smarter prompts\" and started shipping plumbing. Better tool descriptions. Tighter schemas. A real permissions object. An honest budget. A verifier that actually runs the tests. A memory tier that the user can grep.\n\nNone of that is sexy. It's all just software engineering. Which is exactly the point: once you stop expecting magic, the work becomes legible, the failures become diagnosable, and the agent stops being a mysterious black box and starts being a system you maintain like any other.\n\nThe model gets to be the clever part. Everything around it is your job, and that's where the difference between a demo and a product really lives.", "url": "https://wpnews.pro/news/inside-an-ai-agent-planning-tool-use-memory-constraints-and-verification", "canonical_source": "https://dev.to/nazar_boyko/inside-an-ai-agent-planning-tool-use-memory-constraints-and-verification-2fcc", "published_at": "2026-06-27 21:50:51+00:00", "updated_at": "2026-06-27 22:03:36.713967+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "large-language-models", "developer-tools", "ai-research"], "entities": ["LangGraph", "OpenAI Agents SDK", "Claude Agent SDK", "smolagents"], "alternates": {"html": "https://wpnews.pro/news/inside-an-ai-agent-planning-tool-use-memory-constraints-and-verification", "markdown": "https://wpnews.pro/news/inside-an-ai-agent-planning-tool-use-memory-constraints-and-verification.md", "text": "https://wpnews.pro/news/inside-an-ai-agent-planning-tool-use-memory-constraints-and-verification.txt", "jsonld": "https://wpnews.pro/news/inside-an-ai-agent-planning-tool-use-memory-constraints-and-verification.jsonld"}}