How to Create an AI Agent: A Production Walkthrough A developer at BizFlowAI details a production-tested pattern for building reliable AI agents, emphasizing the importance of a clear job spec, robust tool design, and a split between stable system prompts and dynamic runtime context. The approach, refined after a costly failure, includes idempotency keys and budget awareness to prevent runaway costs and duplicate actions. The first agent I shipped to production failed at 3am on a Sunday. It looped on a tool call, burned through $40 in tokens before my budget alarm fired, and left a half-written draft in the database with no way to resume. That night taught me more about agent design than any framework tutorial. Since then I have built a pattern I trust enough to leave running unattended for weeks at BizFlowAI, where agents research, write, optimize and publish content without me touching them. This is that pattern, stripped down to what actually matters. Before you pick LangGraph, CrewAI, or roll your own, write the agent's job spec like you would for a junior engineer. One paragraph. What it owns, what it must never do, what "done" looks like, and which signals tell you it failed. Here is the spec for one of my production agents: The Topic Researcher owns generating a ranked list of 20 content topics per site per week. It reads from keyword pool and search console perf , writes to topic queue . It must never publish, never call paid APIs more than 8 times per run, and must finish in under 6 minutes. Done = 20 topics with score = 0.6 and zero duplicates against the last 90 days. Failure signal = empty queue after a run, or any topic flagged by the dedupe check. If you cannot write this paragraph, do not build the agent. You will end up with a "do everything" prompt that hallucinates its way through ambiguous tasks. The job spec becomes your evaluation rubric later, so write it carefully. Rule of thumb I use : if the spec needs more than 5 tools or more than 3 decision branches, it is two agents, not one. Most agent failures I have debugged were not prompt failures. They were tool failures. The model called a tool with wrong arguments, the tool returned a 4MB JSON blob, or two tools had overlapping responsibilities and the model picked the wrong one. Treat tools like a public API you are shipping to a difficult customer. The customer is the LLM. Three rules I follow: search database returns 200 rows, the model will choke or pick poorly. Return 10 with a has more flag and a next cursor . fetch recent topics days: int, min score: float is self-explanatory. A tool called get data query: str is a coin flip.Here is the actual signature I use for a publishing tool: python def publish post site id: str, draft id: str, idempotency key: str, hash of draft id + content hash scheduled at: datetime | None = None, dry run: bool = False, - PublishResult: """Publishes a draft to the target CMS. Returns PublishResult with url, published at, and cms post id. If idempotency key was used in the last 24h, returns the original result. """ The idempotency key has saved me at least four times. EventBridge retries, Lambda cold-start timeouts, network blips: all of them caused duplicate execution attempts in production. Without the key I would have shipped duplicate content. I no longer write monolithic system prompts. I write a system prompt that is mostly constraints and a runtime context block that gets rebuilt every turn. The split matters because the system prompt is the contract and the context is the working memory. My template: SYSTEM PROMPT stable, ~600 tokens : - Role and goal 3 sentences max - Hard constraints "never call publish post without dry run first on first attempt" - Tool inventory with one-line guidance per tool - Output format for the final answer JSON schema - Stop conditions "when topic queue has 20 entries, call finalize and stop" RUNTIME CONTEXT rebuilt per turn : - Current task ID and attempt number - Tool call history compressed: last 3 calls in full, older ones summarized - Relevant memory entries pulled from pgvector top 5 by relevance - Budget left: tokens, tool calls, seconds Two specific things I have learned the hard way: Tell the agent its budget. When I added "you have 8 tool calls remaining and 4 minutes" to the runtime context, my average run cost dropped roughly 30%. Models are surprisingly good at rationing when they know the limit. Make stop conditions explicit and machine-checkable. "Stop when the task is complete" is not a stop condition. "Stop when topic queue count returned by count queue is = 20" is. Agents need three kinds of memory and most tutorials only cover one. | Memory type | What it stores | Where I put it | TTL | |---|---|---|---| | Scratchpad | Current turn's reasoning, tool results | In-context, compressed each turn | Single run | | Episodic | What happened in past runs decisions, outcomes | Postgres table, summarized | 30-90 days | | Semantic | Facts the agent should "know" brand voice, prior topics | pgvector + BM25 hybrid RRF | Indefinite | The part everyone gets wrong is episodic memory. Without it, your agent makes the same mistake every Tuesday. With it, you can write rules like "before generating a topic, check if a similar topic failed evaluation in the last 60 days, and if so, vary the angle." For semantic memory I use Postgres with pgvector and a BM25 index, then combine results with Reciprocal Rank Fusion. Pure vector search consistently missed exact-match keywords "Q3 pricing" returned posts about Q1 . RRF is 30 lines of SQL and fixes it. -- Simplified RRF combining vector and BM25 WITH vec AS SELECT id, ROW NUMBER OVER ORDER BY embedding <= $1 AS rnk FROM memory WHERE site id = $2 LIMIT 50 , bm AS SELECT id, ROW NUMBER OVER ORDER BY ts rank tsv, plainto tsquery $3 DESC AS rnk FROM memory WHERE site id = $2 LIMIT 50 SELECT id, SUM 1.0 / 60 + rnk AS score FROM SELECT FROM vec UNION ALL SELECT FROM bm x GROUP BY id ORDER BY score DESC LIMIT 10; A naive agent loop is while not done: llm call ; execute tool . That is how you get a 3am token explosion. Mine looks like this: python def run agent task, max tool calls=10, max tokens=80 000, max seconds=300 : state = load or init state task.id budget = Budget max tool calls, max tokens, max seconds while not state.done: if budget.exhausted : return handoff state, reason="budget" context = build context task, state, budget.remaining response = llm.call SYSTEM PROMPT, context, tools=TOOLS budget.charge tokens response.usage if response.stop: state.done = True break for call in response.tool calls: if not policy.allows call, state : state.append tool result call, error="policy denied" continue result = execute tool call, idempotency=state.run id state.append tool result call, result budget.charge call persist state state so we can resume on crash return state.final output Five things in there that matter more than they look: persist state every turn. policy.allows handoff run id . build context charges into the budget too.If you cannot tell me your agent's success rate on a fixed eval set, you do not have a production agent. You have a demo that has not failed yet. I run three evaluation layers: Layer 1: Unit-level. Each tool has tests with golden inputs and outputs. Boring, fast, runs on every commit. Layer 2: Trajectory eval. A frozen set of 30-50 tasks with expected outcomes. Run the full agent, score the trajectory and the final output. For scoring I use a mix: deterministic checks where possible did the topic queue have 20 entries? was the schema valid? and an LLM judge for subjective parts was the topic on brand? . The LLM judge runs with a calibrated rubric and I spot-check 10% of its scores against my own judgment monthly. Layer 3: Production telemetry. Every run logs: tool calls made, tokens used, wall time, budget exhausted, handoff reason, and a sample of final outputs. I look at the dashboard every Monday. Drift shows up in tool call counts before it shows up in output quality. A real number from my own systems: when I added trajectory eval to the content agent and started gating deploys on it, my "weird output" rate dropped from roughly 1 in 25 runs to under 1 in 200. Not zero. Never zero. But low enough to leave running unattended. I deploy almost every agent as a Lambda triggered by EventBridge on a schedule, with state in Postgres Supabase , secrets in AWS Secrets Manager, and observability through CloudWatch + a thin custom dashboard. Nothing exotic. A few opinions: The full production stack for a typical agent in my BizFlowAI ContentStudio: php EventBridge cron - Lambda agent runner, max 15 min - Postgres state, memory, queues - Claude API or local Ollama LLM - Tool Lambdas publish, fetch, analyze - CloudWatch logs, metrics, alarms - SQS DLQ failed runs - Dashboard Next.js, reads Postgres If you are building your first production agent in 2026, my opinionated shortlist: The agents I trust to run while I sleep are not the smartest ones. They are the ones with the tightest tool contracts, the most boring control loop, and the eval set I actually run. If you are working on an agent that needs to leave the demo stage and survive in production, or you want a second pair of eyes on an architecture before you commit to it, I am happy to talk. You can reach me at lazar-milicevic.com/ contact https://lazar-milicevic.com/ contact , or browse more posts on the blog https://lazar-milicevic.com/blog where I write about RAG, evaluation, and the unglamorous parts of shipping AI systems.