You built an AI agent. In the demo it was magic. In the wild it loops, hallucinates a tool call, "forgets" the format you asked for twice, and occasionally does something mildly alarming with your filesystem.
Here's the uncomfortable truth after shipping a lot of these: a flaky agent is almost never a model problem. It's a specification problem. The model is doing exactly what your prompt, your tools, and your control loop told it to do — which turns out to be far less than you thought you said.
Below are seven rules that consistently move an agent from "cool demo" to "I trust this on real work." None of them require a bigger model. Three of them are paste-able guardrails for Claude Code specifically.
"Summarize this well" is unfalsifiable. The agent can't tell when it's done, and neither can you. Replace every vibe with something a script could check:
The reliability win isn't the format — it's that failure becomes detectable. If you can write an assert
for the output, you can build a retry around it. If you can't, you have no idea how often it's wrong.
Most "the agent called the wrong tool" bugs are actually "the tool description was ambiguous" bugs. A tool named search
that sometimes means web search and sometimes means database lookup will get confused. Two rules:
search_web
, not search
)."error: 'date' must be YYYY-MM-DD, got '6/27'"
beats a stack trace or a silent null
. The model can recover from a sentence; it cannot recover from undefined
.Every autonomous agent needs three hard limits or it will eventually run forever: a max step count, a wall-clock timeout, and a no-progress detector (if the last two actions are identical, stop). The model will not reliably stop itself. This is your job, in code, outside the prompt.
Anything you put in the prompt is a request. The model usually honors it. "Usually" is not a security model. If an action is dangerous or irreversible, gate it in code, not in English.
Concretely, in Claude Code you can do this with hooks — deterministic scripts that run before a tool call and can block it. Here are the three I put in almost every project.
Guardrail A — block edits to sensitive paths. A PreToolUse
hook that refuses any write to .env
, .git/
, or secrets:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Edit|Write",
"hooks": [
{
"type": "command",
"command": "node -e \"const p=process.env.CLAUDE_TOOL_INPUT_FILE_PATH||''; if(/(^|\\/)\\.env|(^|\\/)\\.git\\/|secrets/i.test(p)){console.error('Blocked: protected path '+p);process.exit(2)}\""
}
]
}
]
}
}
Exit code 2
tells Claude Code the action was blocked and feeds the reason back to the model — so it adjusts instead of silently failing.
Guardrail B — never let "rm -rf" through. A PreToolUse
matcher on Bash
that hard-fails on obviously destructive commands:
{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": "node -e \"const c=process.env.CLAUDE_TOOL_INPUT_COMMAND||''; if(/rm\\s+-rf|git\\s+reset\\s+--hard|:\\(\\)\\{/.test(c)){console.error('Blocked destructive command');process.exit(2)}\""
}]
}
Guardrail C — auto-format after every edit, so the agent's output is consistent without you asking each time:
{
"matcher": "Edit|Write",
"hooks": [{ "type": "command", "command": "prettier --write \"$CLAUDE_TOOL_INPUT_FILE_PATH\" 2>/dev/null || true" }]
}
The pattern that matters here: the model proposes, deterministic code disposes. Hooks run whether or not the model "felt like" honoring an instruction.
If your agent's "memory" is just the growing chat transcript, it will drift — early instructions get diluted, and cost climbs every turn. Keep the durable facts (the task, the constraints, what's been done) in a small structured object you re-inject each step, and let the transcript be disposable. An agent that re-reads its own goal every loop stays on task far longer than one trusting a 40-message context to hold.
You wouldn't ship a function with zero tests; don't ship an agent with zero either. Build a tiny eval set — even 10–15 representative inputs with checkable expected properties (rule #1 makes this possible). Run it on every prompt change. The first time a "harmless" prompt tweak silently breaks 3 of 12 cases, you'll understand why this is the highest-leverage 30 minutes in the whole project.
When the agent is uncertain or a guardrail trips, the worst outcome is confidently shipping a wrong answer. Design an explicit "I'm not sure, here's why" path. A reliable agent that occasionally says "I couldn't verify X, stopping" earns more trust than a confident one that's wrong 5% of the time on work that matters.
Reliability is not a model you buy; it's a discipline you impose. Make success checkable (#1, #6), make tools unambiguous and loud (#2), bound the loop (#3), enforce the dangerous stuff in code rather than in prose (#4), externalize state (#5), and fail loud to a human (#7). Do those and an ordinary model behaves like a dependable one.
If you want this as a printable checklist plus three more ready-to-paste Claude Code guardrails, I wrote a free field guide — no email required: ** Ship a Reliable AI Agent — the Free Field Guide**. It's the 10-minute version of everything above.
What's the failure mode that bit you hardest building agents? I collect these — the patterns repeat more than you'd think.