# I Built 9 Production AI Agents With Claude Code — Here Is the Complete Workflow

> Source: <https://dev.to/akaranjkar08/i-built-9-production-ai-agents-with-claude-code-here-is-the-complete-workflow-2j3d>
> Published: 2026-05-30 21:11:33+00:00

**TL;DR:** Claude Code is a complete agent runtime, not just a coding assistant. Over 14 weeks, I built and shipped 9 production agents — an SEO research pipeline, a daily analytics oracle, a content syndication system, a deploy watchdog, and five specialist growth agents — using nothing but CLAUDE.md files, MCP servers, hooks, and subagents. No Hermes. No LangChain. No external orchestrator. Total infrastructure cost: under $180/month. This guide walks through the exact architecture, every configuration file, the failure modes I hit, and the patterns that actually survive production.

Claude Code in 2026 is not the terminal autocomplete tool it was twelve months ago. Anthropic has shipped five distinct extension layers that, composed together, turn it into a full agent orchestration framework. Understanding which layer handles which responsibility is the difference between an agent that works in a demo and one that runs unsupervised for weeks.

The first layer is **CLAUDE.md** — a Markdown file at your project root that is always loaded into the model's context window. Every instruction, every constraint, every architectural decision you write here shapes every action the agent takes. This is not documentation. It is the agent's operating system. A well-written CLAUDE.md eliminates entire categories of failure by making the right behavior the default behavior. A poorly written one — or worse, an empty one — produces an agent that makes reasonable-sounding decisions that silently break your system.

The second layer is **MCP servers** — external tool access over the Model Context Protocol. MCP servers let Claude Code interact with databases, APIs, browsers, cloud services, and any system that exposes a JSON-RPC interface. As of May 2026, over 2,300 public MCP servers exist, and any team can build custom ones. The critical design decision: limit yourself to 3-5 servers per agent. Each server adds tool definitions to the context window, and tool-selection quality degrades measurably above that threshold.

The third layer is **skills** — reusable Markdown workflow files stored in `.claude/skills/`

and invoked as slash commands. Skills encode multi-step procedures that would otherwise require the agent to figure out the process from scratch each time. A blog-writing skill, for example, encodes the SEO checklist, content structure, data format, build verification, and commit conventions — turning a 45-minute manual process into a single command.

The fourth layer is **hooks** — deterministic shell scripts that execute at specific lifecycle points: `PreToolUse`

, `PostToolUse`

, `Stop`

, `SessionStart`

, and `UserPromptSubmit`

. Hooks are not AI. They are plain shell scripts that run before or after the model acts. Exit code 2 blocks the tool call entirely. This is how you build hard guardrails — not by asking the model to police itself, but by making dangerous actions physically impossible.

The fifth layer is **subagents** — isolated Claude sessions launched from a parent session with their own context window. Subagents are defined as Markdown files in `.claude/agents/`

and can be spawned in parallel, in the background, or in isolated git worktrees. They communicate results back to the parent but cannot see the parent's conversation history. This is the primitive that makes multi-agent coordination possible without an external orchestrator.

Every production agent I have built starts with the same pattern: a CLAUDE.md file that functions as the agent's constitution. Not a loose collection of tips — a structured document with decision trees, forbidden actions, verification protocols, and explicit failure modes.

Here is the skeleton that has survived 14 weeks of production use across all 9 agents:

```
# Agent Name — Purpose Statement (one line)

## Decision Engine
```

Multi-file change (3+ files)? → Use persistent-planner skill

Research + build? → Research FIRST, THEN build

200 lines new code? → Split into subagents by module

Bug with unclear cause? → Investigate before fixing

Same fix attempted 3+ times? → STOP. Surface root cause to user

```
## Hard Rules (each from a real incident)
- RULE 1: [What happened] → [What to never do again]
- RULE 2: [What happened] → [What to never do again]

## Verification Protocol — MANDATORY
After ANY non-trivial implementation:
1. Spawn verification-agent — read-only, adversarial
2. Wait for VERDICT — only report "done" after PASS
3. Never claim "fixed" based on reading code — run actual commands

## Model Tiering
| Task | Model | Why |
|------|-------|-----|
| Trust-boundary code | Opus | Payment, auth, webhooks |
| Feature implementation | Sonnet | Routine edits, CRUD |
| Batch text work | Haiku | SEO descriptions, formatting |
```

The decision engine section is not aspirational. It is load-bearing. Without it, I watched agents attempt 200-line rewrites in a single pass, fail, retry the same approach, fail again, and burn through $15 in tokens producing nothing useful. With the decision engine, the agent reads the instruction, routes to the correct approach, and succeeds on the first or second attempt.

The hard rules section grows organically from production incidents. My CLAUDE.md started with zero rules. It now has 24. Each one exists because ignoring it caused a real outage, a corrupted deploy, or a silent data loss. Rule 22, for example: `headers() catch-all MUST come BEFORE private route rules`

— discovered when checkout pages were publicly cached at Cloudflare's edge for 30 seconds because the header ordering was reversed. Rule 14: `Do NOT add slug checks to proxy.ts`

— three consecutive production crashes from the same attempted fix.

MCP servers are the agent's hands. Without them, Claude Code can read files and run shell commands. With them, it can query Google Search Console, pull GA4 analytics, manage Cloudflare Workers, interact with browsers, search documentation, and call any API with a published MCP server.

Here is the MCP configuration that powers my analytics oracle agent — the one that runs every morning at 9:03 AM IST and produces a daily situation report:

```
// ~/.claude/settings.json (user scope — available to all projects)
{
  "mcpServers": {
    "gsc": {
      "command": "/Users/me/.local/bin/mcp-gsc",
      "env": { "GSC_SKIP_OAUTH": "true" }
    },
    "ga4": {
      "command": "/Users/me/.local/bin/ga4-mcp-server",
      "env": {
        "GA4_PROPERTY_ID": "529733024",
        "GOOGLE_APPLICATION_CREDENTIALS": "/Users/me/.config/google-seo-mcp/service-account.json"
      }
    },
    "cloudflare": {
      "command": "npx",
      "args": ["-y", "@anthropic-ai/mcp-cloudflare"],
      "env": {
        "CLOUDFLARE_ACCOUNT_ID": "a319e...",
        "CLOUDFLARE_API_TOKEN": "cfut_..."
      }
    }
  }
}
```

Three servers. Not twelve. I experimented with adding Slack, GitHub, Notion, and Playwright MCP servers simultaneously. The result: tool selection accuracy dropped from roughly 95% to below 80%. The model would choose a Slack tool when it meant to use GitHub, or attempt a Playwright screenshot when a simple curl would suffice. The 3-5 server sweet spot is not a suggestion — it is a measured threshold.

For project-scoped servers that the team shares, put the configuration in `.claude/settings.json`

at the project root and commit it to git. For user-scoped servers with personal credentials, use `~/.claude/settings.json`

. Never commit API tokens to project-scoped config.

The single most important lesson from 14 weeks of production agents: **do not rely on the model to enforce constraints.** Models are probabilistic. Hooks are deterministic. If an action must never happen — a force push to main, a database flush without confirmation, a deploy during an active CI run — encode that constraint in a hook, not in a prompt.

Here is a real hook from my production setup that prevents accidental destructive git operations:

```
// .claude/settings.json
{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "command": "echo '$TOOL_INPUT' | python3 -c "import sys,json; cmd=json.load(sys.stdin).get('command',''); bad=['git push --force','git reset --hard','FLUSHDB','DROP TABLE']; sys.exit(2 if any(b in cmd for b in bad) else 0)""
      }
    ]
  }
}
```

Exit code 2 blocks the tool call. The model receives a rejection message and must find an alternative approach. This is not a suggestion to the model — it is a physical wall. The model cannot push force, cannot hard reset, cannot flush the database, no matter how convincing its reasoning.

The `SessionStart`

hook is equally powerful for agent initialization. My production setup runs a health check script at the start of every session that verifies container status, checks for uncommitted changes, validates environment variables, and confirms MCP server connectivity. If any check fails, the agent starts with full context about what is broken — instead of discovering it 10 minutes into a task after modifying files that should not have been touched.

```
{
  "hooks": {
    "SessionStart": [
      {
        "command": "cd storefront && npx tsx scripts/health-check.ts 2>&1 | head -50"
      }
    ]
  }
}
```

The pattern that changed everything was realizing that Claude Code's built-in Agent tool — the ability to spawn subagents — eliminates the need for external orchestration frameworks. A parent session can launch multiple subagents in parallel, each with a specific brief, and aggregate their results.

Here is how my growth coordinator agent works. It is a single CLAUDE.md-defined agent that spawns 5 specialist subagents every Monday morning:

```
# Growth Coordinator — Weekly Multi-Agent Run

## Workflow
1. Spawn analytics-oracle agent → daily metrics + anomalies
2. Spawn seo-dominator agent → keyword gaps + ranking changes
3. Spawn content-architect agent → content calendar + topic gaps
4. Spawn cro-assassin agent → conversion funnel analysis
5. Spawn competitive-intel agent → competitor price/feature delta

## Coordination Rules
- Launch agents 1-3 in parallel (independent data)
- Wait for analytics-oracle results before launching CRO agent (needs baseline)
- Aggregate all results into weekly synthesis report
- Post synthesis to Telegram channel
```

Each specialist agent is defined as a Markdown file in `.claude/agents/`

with its own system prompt, tool access, and output format. The parent agent reads their results and synthesizes. No LangGraph. No CrewAI. No custom Python orchestration code. The orchestration is declarative Markdown, and the execution is Claude Code's native subagent primitive.

The critical constraint I learned the hard way: **subagents must never push to git independently.** Early in my setup, I had parallel build agents each committing and pushing their changes. Three pushes in rapid succession triggered three simultaneous deploys, all racing on Docker Compose, resulting in a 502 outage. The fix: all subagents write their changes to files. The parent agent reviews, commits once, and pushes once.

Running 9 production agents without cost discipline would be financially irresponsible. Anthropic's current pricing — Opus at $5/$25 per million input/output tokens, Sonnet at $3/$15, Haiku at $1/$5 — means model selection directly determines whether your agent pipeline costs $50/month or $500/month for the same work.

The tiering system I use after extensive experimentation:

| Task Class | Model | Monthly Cost (est.) | Why This Tier |
|---|

| Trust-boundary code (payment, auth, webhooks) | **Opus** | ~$30 | Security errors are expensive; Opus catches edge cases Sonnet misses |

| Feature implementation, routine edits | Sonnet | ~$60 | 80% of work; Sonnet is fast and accurate for known patterns |

| Batch text (SEO meta, descriptions, formatting) | Haiku | ~$8 | Mechanical work; 10x cheaper, same quality for substitution tasks |

| Cross-provider audit | Codex (GPT-5.4) | ~$5 | Different model catches different bugs; found 8 issues Claude missed |

The cross-provider audit is the most counterintuitive line item. I run OpenAI's Codex on trust-boundary code after Claude reviews it. In May 2026, a Codex audit found that my cache header ordering was exposing checkout pages at Cloudflare's edge — a bug that had been live for weeks and that Claude had not flagged across multiple reviews. Different models have different blind spots. For code that handles money or authentication, spending an extra $5/month on a second opinion is trivially worth it.

Prompt caching reduces input costs by 90% for repeated context. If your agent loads the same CLAUDE.md, the same tool definitions, and the same project context on every run, the cache hit rate is extremely high after the first invocation. My analytics oracle agent — which runs daily with the same system prompt — costs roughly $0.40 per run after caching, compared to $2.80 without it.

This is not a which-is-best comparison. Each tool has a genuine sweet spot, and using the wrong one for a task wastes time and money.

**Claude Code wins at:** codebase-wide analysis, multi-file refactors, agent orchestration, CI/CD integration, and any task where terminal-native execution matters. One benchmark showed Claude Code completing a task in 33,000 tokens that consumed 188,000 tokens in Cursor's agent mode — a 5.7x efficiency advantage for complex, cross-file operations. Claude Code also has the deepest extension system (CLAUDE.md + MCP + hooks + skills + subagents) of any AI coding tool.

**Cursor wins at:** in-editor work. If you are editing a single file, navigating code visually, or doing rapid inline iterations, Cursor's VS Code integration is faster than switching to a terminal. Cursor 3.3's Bugbot — which monitors CI and proposes fixes automatically — is a genuine time-saver for teams with extensive test suites.

**Codex wins at:** long-running autonomous tasks. OpenAI's cloud-based architecture lets Codex work on a problem for hours without maintaining a local session. For tasks like "migrate this 500-file codebase from JavaScript to TypeScript" or "write comprehensive tests for every untested module," Codex's patience and autonomy are unmatched.

My production workflow uses all three. Claude Code is the primary agent runtime — it runs the daily pipelines, handles deploys, and manages the codebase. Cursor is open alongside it for visual editing sessions. Codex runs periodic deep audits that benefit from its multi-hour attention span.

Here is the actual agent inventory running in production, with real monthly costs after 14 weeks of operation:

| Agent | Schedule | Model | Monthly Cost | What It Does |
|---|

| Analytics Oracle | Daily 9:03 AM | Sonnet | $12 | Pulls GSC + GA4 via MCP, identifies anomalies, produces 3 ship-now actions |

| SEO Research Pipeline | Every 4 hours | Haiku + Sonnet | $25 | Monitors keyword opportunities, competitor content, SERP changes |

| Deploy Watchdog | Every 60 seconds | Bash only | $0 | Checks container health, auto-rolls back on 3 consecutive failures |

| Content Syndication | Every 6 hours | Haiku | $8 | Cross-posts to Hashnode, Dev.to, Blogger with canonical URLs |

| Growth Coordinator | Weekly Monday | Opus | $15 | Spawns 5 specialist subagents, synthesizes weekly report |

| Verification Agent | On-demand | Sonnet | $10 | Read-only adversarial review of every 3+ file change |

| Blog Writer | On-demand | Sonnet | $20 | Researches topic, writes SEO-optimized post, builds, commits, pushes |

| Tool Builder | On-demand | Sonnet | $15 | Builds browser-based tools with UI, registry, sitemap integration |

| Product QA | Weekly | Haiku | $5 | Sweeps 2,000+ products for metadata completeness, dead links, schema |

Total: approximately $110/month. The deploy watchdog costs nothing — it is a pure bash script with no AI component, checking HTTP status codes and triggering Docker rollbacks. The most expensive agent is the SEO research pipeline at $25/month, driven by its 6x daily frequency and the Sonnet calls required to analyze SERP data meaningfully.

No honest guide about production agents can skip the failures. Here are the five most expensive lessons from 14 weeks:

**Failure 1: The split-brain deploy.** Two automation systems — GitHub Actions and a legacy VPS cron — both believed they owned the deploy process. They raced on `docker compose down/up`

, producing intermittent 502 errors that looked random but were actually deterministic conflicts. Fix: killed the legacy cron, added a concurrency group to GitHub Actions, added a filesystem lock (`/tmp/wowhow-deploy.lock`

) as a secondary gate. Three layers of protection because one layer was not enough.

**Failure 2: The noindex massacre.** An agent applied `robots: { index: false }`

to 2,600 pages — every product, topic hub, GST reference, and collection page — based on a reasonable-sounding interpretation of "hide thin content from Google." Impressions crashed from 7,500/day to near zero within a week. The fix took 8 days to fully reverse. Lesson: agents must never make bulk SEO changes without explicit human approval, regardless of how logical the reasoning sounds. This is now CLAUDE.md Rule 20.

**Failure 3: The social media suspension.** A content syndication agent posted 190 Mastodon toots in 15 minutes. The API allowed it — rate limits were not exceeded. But the instance moderators flagged it as spam and suspended the account permanently. API rate limits and platform moderation policies are different things. The agent now has a hard cap: 1 post per 30-60 minutes, maximum 20-30 per day, on any social platform.

**Failure 4: The OAuth cascade.** Deleting an old Google Cloud OAuth client — which seemed like a cleanup task — invalidated the refresh tokens used by three different scripts on the VPS. The daily analytics report, the GSC sitemap submission, and the GA4 data pipeline all failed silently. None of them had alerting configured for authentication failures. Fix: every API-dependent script now checks its authentication status before executing and sends a Telegram alert on failure.

**Failure 5: The parallel push disaster.** Three subagents ran in parallel, each making changes to different files. Each one committed and pushed independently. Three pushes triggered three GitHub Actions deploys. All three SSHed into the VPS simultaneously and raced on Docker Compose. Result: containers in an inconsistent state, Redis health checks failing, 502 for 12 minutes. Fix: subagents write files but never commit. The parent agent handles all git operations as a single atomic batch.

The fastest path from zero to a running production agent:

**Step 1: Install Claude Code.** If you have not already: `npm install -g @anthropic-ai/claude-code`

. Verify with `claude --version`

. You need a Pro ($20/month) or Max ($100-200/month) subscription, or an API key.

**Step 2: Create your CLAUDE.md.** Start minimal. Write three things: what the project is, what the agent should never do, and how to verify its work. You will add rules as you discover failure modes — this is expected and healthy.

```
# My Project

## What This Is
Node.js API server with PostgreSQL. Deployed via Docker on a VPS.

## Hard Rules
- Never run DROP TABLE or TRUNCATE without explicit user confirmation
- Never push to main without running tests first
- Never modify .env files

## Verification
After changes, run: npm test && npm run build
Both must pass before committing.
```

**Step 3: Add one MCP server.** Start with something useful and low-risk. The GitHub MCP server is a good first choice:

```
claude mcp add --scope project --transport http github https://api.githubcopilot.com/mcp/
```

**Step 4: Create your first hook.** A SessionStart hook that shows git status gives the agent immediate context about what state the project is in:

```
// .claude/settings.json
{
  "hooks": {
    "SessionStart": [
      { "command": "git status && git log --oneline -5" }
    ]
  }
}
```

**Step 5: Create your first subagent.** A code review agent that runs read-only and checks your work:

```
// .claude/agents/reviewer.md
# Code Reviewer

Review the most recent changes for:
- Security vulnerabilities (OWASP Top 10)
- Performance issues
- Missing error handling at system boundaries
- Adherence to project conventions in CLAUDE.md

Report findings as: file:line — issue — severity (HIGH/MEDIUM/LOW)
Do NOT modify any files. Read-only analysis only.
```

**Step 6: Run it.** Open Claude Code, type `/agents`

to see your available agents, and invoke the reviewer after making some changes. Watch what it catches. Refine the agent's instructions based on what it misses or flags incorrectly.

That is a functional agent setup in under 30 minutes. From here, the path is incremental: add rules to CLAUDE.md when things break, add MCP servers when you need external tool access, add hooks when you need hard guardrails, and add subagents when tasks become complex enough to benefit from specialization.

If I were starting over with everything I know now, three changes would save weeks of debugging:

First, I would write the verification protocol into CLAUDE.md on day one — not after the third silent production break. The pattern is simple: after any change touching more than two files, spawn a read-only verification agent before claiming the work is done. This catches roughly 40% of the bugs that would otherwise reach production.

Second, I would set up Telegram alerting for every API-dependent automation from the start. Silent failures are the most expensive kind. An agent that fails loudly costs you 5 minutes. An agent that fails silently costs you days of stale data and missed opportunities before you notice.

Third, I would resist the temptation to add MCP servers aggressively. My initial setup had 8 servers connected. Tool selection accuracy dropped. Response times increased. Context windows filled with tool definitions instead of project context. I cut back to 3-5 per agent and quality improved immediately.

The production agent landscape in 2026 is still early. Claude Code, Cursor, Codex, and the dozens of agent frameworks competing for adoption are all improving rapidly. But the fundamentals — clear constraints, hard guardrails, cost discipline, and verification before deployment — will outlast any specific tool. Build those habits into your agent architecture from the start, and the specific tools become interchangeable.

Every tool and template mentioned in this guide is available at [wowhow.cloud](https://dev.to/products). The [Claude Code Routines Recipe Pack](https://dev.to/product/claude-code-routines-recipe-pack-v1) includes production-tested CLAUDE.md templates, hook configurations, and agent definitions you can adapt to your own projects. The [Token Counter](https://dev.to/tools/token-counter) and [AI API Cost Calculator](https://dev.to/tools/ai-api-cost-calculator) help estimate costs before committing to an agent architecture.

*Originally published at wowhow.cloud*
