{"slug": "how-to-create-an-ai-agent-a-production-walkthrough", "title": "How to Create an AI Agent: A Production Walkthrough", "summary": "A developer at BizFlowAI details a production-tested pattern for building reliable AI agents, emphasizing the importance of a clear job spec, robust tool design, and a split between stable system prompts and dynamic runtime context. The approach, refined after a costly failure, includes idempotency keys and budget awareness to prevent runaway costs and duplicate actions.", "body_md": "The first agent I shipped to production failed at 3am on a Sunday. It looped on a tool call, burned through $40 in tokens before my budget alarm fired, and left a half-written draft in the database with no way to resume. That night taught me more about agent design than any framework tutorial. Since then I have built a pattern I trust enough to leave running unattended for weeks at BizFlowAI, where agents research, write, optimize and publish content without me touching them.\n\nThis is that pattern, stripped down to what actually matters.\n\nBefore you pick LangGraph, CrewAI, or roll your own, write the agent's job spec like you would for a junior engineer. One paragraph. What it owns, what it must never do, what \"done\" looks like, and which signals tell you it failed.\n\nHere is the spec for one of my production agents:\n\nThe Topic Researcher owns generating a ranked list of 20 content topics per site per week. It reads from\n\n`keyword_pool`\n\nand`search_console_perf`\n\n, writes to`topic_queue`\n\n. It must never publish, never call paid APIs more than 8 times per run, and must finish in under 6 minutes. Done = 20 topics with score >= 0.6 and zero duplicates against the last 90 days. Failure signal = empty queue after a run, or any topic flagged by the dedupe check.\n\nIf you cannot write this paragraph, do not build the agent. You will end up with a \"do everything\" prompt that hallucinates its way through ambiguous tasks. The job spec becomes your evaluation rubric later, so write it carefully.\n\n**Rule of thumb I use**: if the spec needs more than 5 tools or more than 3 decision branches, it is two agents, not one.\n\nMost agent failures I have debugged were not prompt failures. They were tool failures. The model called a tool with wrong arguments, the tool returned a 4MB JSON blob, or two tools had overlapping responsibilities and the model picked the wrong one.\n\nTreat tools like a public API you are shipping to a difficult customer. The customer is the LLM.\n\nThree rules I follow:\n\n`search_database`\n\nreturns 200 rows, the model will choke or pick poorly. Return 10 with a `has_more`\n\nflag and a `next_cursor`\n\n.`fetch_recent_topics(days: int, min_score: float)`\n\nis self-explanatory. A tool called `get_data(query: str)`\n\nis a coin flip.Here is the actual signature I use for a publishing tool:\n\n``` python\ndef publish_post(\n    site_id: str,\n    draft_id: str,\n    idempotency_key: str,  # hash of draft_id + content_hash\n    scheduled_at: datetime | None = None,\n    dry_run: bool = False,\n) -> PublishResult:\n    \"\"\"Publishes a draft to the target CMS. \n    Returns PublishResult with url, published_at, and cms_post_id.\n    If idempotency_key was used in the last 24h, returns the original result.\n    \"\"\"\n```\n\nThe idempotency key has saved me at least four times. EventBridge retries, Lambda cold-start timeouts, network blips: all of them caused duplicate execution attempts in production. Without the key I would have shipped duplicate content.\n\nI no longer write monolithic system prompts. I write a system prompt that is mostly constraints and a runtime context block that gets rebuilt every turn. The split matters because the system prompt is the contract and the context is the working memory.\n\nMy template:\n\n```\nSYSTEM PROMPT (stable, ~600 tokens):\n- Role and goal (3 sentences max)\n- Hard constraints (\"never call publish_post without dry_run first on first attempt\")\n- Tool inventory with one-line guidance per tool\n- Output format for the final answer (JSON schema)\n- Stop conditions (\"when topic_queue has 20 entries, call finalize and stop\")\n\nRUNTIME CONTEXT (rebuilt per turn):\n- Current task ID and attempt number\n- Tool call history compressed: last 3 calls in full, older ones summarized\n- Relevant memory entries pulled from pgvector (top 5 by relevance)\n- Budget left: tokens, tool calls, seconds\n```\n\nTwo specific things I have learned the hard way:\n\n**Tell the agent its budget.** When I added \"you have 8 tool calls remaining and 4 minutes\" to the runtime context, my average run cost dropped roughly 30%. Models are surprisingly good at rationing when they know the limit.\n\n**Make stop conditions explicit and machine-checkable.** \"Stop when the task is complete\" is not a stop condition. \"Stop when `topic_queue`\n\ncount returned by `count_queue()`\n\nis >= 20\" is.\n\nAgents need three kinds of memory and most tutorials only cover one.\n\n| Memory type | What it stores | Where I put it | TTL |\n|---|---|---|---|\n| Scratchpad | Current turn's reasoning, tool results | In-context, compressed each turn | Single run |\n| Episodic | What happened in past runs (decisions, outcomes) | Postgres table, summarized | 30-90 days |\n| Semantic | Facts the agent should \"know\" (brand voice, prior topics) | pgvector + BM25 hybrid (RRF) | Indefinite |\n\nThe part everyone gets wrong is episodic memory. Without it, your agent makes the same mistake every Tuesday. With it, you can write rules like \"before generating a topic, check if a similar topic failed evaluation in the last 60 days, and if so, vary the angle.\"\n\nFor semantic memory I use Postgres with pgvector and a BM25 index, then combine results with Reciprocal Rank Fusion. Pure vector search consistently missed exact-match keywords (\"Q3 pricing\" returned posts about Q1). RRF is 30 lines of SQL and fixes it.\n\n```\n-- Simplified RRF combining vector and BM25\nWITH vec AS (\n  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rnk\n  FROM memory WHERE site_id = $2 LIMIT 50\n),\nbm AS (\n  SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank(tsv, plainto_tsquery($3)) DESC) AS rnk\n  FROM memory WHERE site_id = $2 LIMIT 50\n)\nSELECT id, SUM(1.0 / (60 + rnk)) AS score\nFROM (SELECT * FROM vec UNION ALL SELECT * FROM bm) x\nGROUP BY id ORDER BY score DESC LIMIT 10;\n```\n\nA naive agent loop is `while not done: llm_call(); execute_tool()`\n\n. That is how you get a 3am token explosion. Mine looks like this:\n\n``` python\ndef run_agent(task, max_tool_calls=10, max_tokens=80_000, max_seconds=300):\n    state = load_or_init_state(task.id)\n    budget = Budget(max_tool_calls, max_tokens, max_seconds)\n\n    while not state.done:\n        if budget.exhausted():\n            return handoff(state, reason=\"budget\")\n\n        context = build_context(task, state, budget.remaining())\n        response = llm.call(SYSTEM_PROMPT, context, tools=TOOLS)\n        budget.charge_tokens(response.usage)\n\n        if response.stop:\n            state.done = True\n            break\n\n        for call in response.tool_calls:\n            if not policy.allows(call, state):\n                state.append_tool_result(call, error=\"policy_denied\")\n                continue\n            result = execute_tool(call, idempotency=state.run_id)\n            state.append_tool_result(call, result)\n            budget.charge_call()\n\n        persist_state(state)  # so we can resume on crash\n\n    return state.final_output\n```\n\nFive things in there that matter more than they look:\n\n`persist_state`\n\nevery turn.`policy.allows`\n\n`handoff`\n\n`run_id`\n\n.`build_context`\n\ncharges into the budget too.If you cannot tell me your agent's success rate on a fixed eval set, you do not have a production agent. You have a demo that has not failed yet.\n\nI run three evaluation layers:\n\n**Layer 1: Unit-level.** Each tool has tests with golden inputs and outputs. Boring, fast, runs on every commit.\n\n**Layer 2: Trajectory eval.** A frozen set of 30-50 tasks with expected outcomes. Run the full agent, score the trajectory and the final output. For scoring I use a mix: deterministic checks where possible (did the topic queue have 20 entries? was the schema valid?) and an LLM judge for subjective parts (was the topic on brand?). The LLM judge runs with a calibrated rubric and I spot-check 10% of its scores against my own judgment monthly.\n\n**Layer 3: Production telemetry.** Every run logs: tool calls made, tokens used, wall time, budget exhausted, handoff reason, and a sample of final outputs. I look at the dashboard every Monday. Drift shows up in tool call counts before it shows up in output quality.\n\nA real number from my own systems: when I added trajectory eval to the content agent and started gating deploys on it, my \"weird output\" rate dropped from roughly 1 in 25 runs to under 1 in 200. Not zero. Never zero. But low enough to leave running unattended.\n\nI deploy almost every agent as a Lambda triggered by EventBridge on a schedule, with state in Postgres (Supabase), secrets in AWS Secrets Manager, and observability through CloudWatch + a thin custom dashboard. Nothing exotic.\n\nA few opinions:\n\nThe full production stack for a typical agent in my BizFlowAI ContentStudio:\n\n``` php\nEventBridge (cron)\n  -> Lambda (agent runner, max 15 min)\n     -> Postgres (state, memory, queues)\n     -> Claude API or local Ollama (LLM)\n     -> Tool Lambdas (publish, fetch, analyze)\n  -> CloudWatch (logs, metrics, alarms)\n  -> SQS DLQ (failed runs)\n  -> Dashboard (Next.js, reads Postgres)\n```\n\nIf you are building your first production agent in 2026, my opinionated shortlist:\n\nThe agents I trust to run while I sleep are not the smartest ones. They are the ones with the tightest tool contracts, the most boring control loop, and the eval set I actually run.\n\nIf you are working on an agent that needs to leave the demo stage and survive in production, or you want a second pair of eyes on an architecture before you commit to it, I am happy to talk. You can reach me at [lazar-milicevic.com/#contact](https://lazar-milicevic.com/#contact), or browse more posts on the [blog](https://lazar-milicevic.com/blog) where I write about RAG, evaluation, and the unglamorous parts of shipping AI systems.", "url": "https://wpnews.pro/news/how-to-create-an-ai-agent-a-production-walkthrough", "canonical_source": "https://dev.to/lamingsrb/how-to-create-an-ai-agent-a-production-walkthrough-41ga", "published_at": "2026-06-29 06:24:35+00:00", "updated_at": "2026-06-29 06:57:17.744817+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-tools", "developer-tools", "ai-infrastructure"], "entities": ["BizFlowAI", "LangGraph", "CrewAI", "EventBridge", "pgvector"], "alternates": {"html": "https://wpnews.pro/news/how-to-create-an-ai-agent-a-production-walkthrough", "markdown": "https://wpnews.pro/news/how-to-create-an-ai-agent-a-production-walkthrough.md", "text": "https://wpnews.pro/news/how-to-create-an-ai-agent-a-production-walkthrough.txt", "jsonld": "https://wpnews.pro/news/how-to-create-an-ai-agent-a-production-walkthrough.jsonld"}}