Your Agent Just Called the Same Tool 47 Times. Here's the 20-Line Detector.

Here is a factual summary of the article:

The article describes a common failure in AI agent systems where an agent repeatedly calls the same tool with identical arguments, wasting money (citing a case where a user lost $47,000 in one weekend). It argues that the standard fix of setting a maximum iteration limit is flawed because it penalizes legitimate long-running tasks while still allowing repetitive loops. Instead, the author provides a 20-line Python detector that uses a sliding window to track repeated (tool_name, args_hash) pairs and raises an alert if the same call appears too many times within a short window.

- Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs https://www.amazon.com/dp/B0GYJZ2XJD - Also by me: Thinking in Go 2-book series — Complete Guide to Go Programming https://xgabriel.com/go-book + Hexagonal Architecture in Go https://xgabriel.com/hexagonal-go - My project: Hermes IDE https://hermes-ide.com | GitHub https://github.com/hermes-hq/hermes-ide — an IDE for developers who ship with Claude Code and other AI coding tools - Me: xgabriel.com https://xgabriel.com | GitHub https://github.com/gabrielanhaia The $47K loop A LangChain user burned roughly $47,000 in a single weekend because their agent looped on one tool call. The story made the rounds on Twitter and HN in 2023, and the shape of the failure has not changed. The agent called the same retrieval tool, with the same arguments, over and over, while the framework happily fed every result back into the next prompt and billed each round. Ten seconds with the trace and you'd see it. Forty-seven spans in a row, same tool name , same args payload, different timestamps. No human writes that. No model wants to write that. But put a tool-using agent in front of a fuzzy question with a slightly-broken tool and it'll grind on the same call until something kills it. The thing that should have killed it is twenty lines of Python. It doesn't live in the agent. It lives in the trace pipeline, so it survives framework swaps, model upgrades, and the next refactor your team does at 4pm on a Friday. Why max iterations is the wrong knob The advice you get on the first page of Google is "set max iterations=10 ". This is wrong for the same reason a speed limit on a residential street is wrong for a highway. It punishes legitimate work. Consider two agents running in the same product. Agent A is a deep research assistant. It pulls a PDF, runs a search, summarizes, follows three citations, runs three more searches, dedupes the findings, and writes a memo. Eighty tool calls, all different, all useful. The user paid for that depth. Agent B is a question-answerer with a flaky vector index. On query 1 it calls search docs query="refund policy" . The result is empty because of a stale embedding. The agent reasons "I should try again" and calls search docs query="refund policy" a second time. Then a third. By step 7 it has called the exact same tool with the exact same arguments seven times in a row. A depth limit at 10 cuts off Agent A before it finishes and lets Agent B burn six iterations before it trips. You want the opposite: Agent A running as long as it's making progress, Agent B dying at iteration 4. Repetition is the signal, not depth. The detector in 20 lines Here it is. A sliding-window counter keyed on tool name, args hash . Push every tool invocation. If any key shows up threshold times in the last window calls, raise. python from collections import deque, Counter from dataclasses import dataclass, field import hashlib import json class LoopDetected Exception : pass @dataclass class LoopDetector: window: int = 10 threshold: int = 4 calls: deque = field default factory=deque def observe self, tool name: str, args: dict - None: key = tool name, args hash args self. calls.append key if len self. calls self.window: self. calls.popleft counts = Counter self. calls most common key, hits = counts.most common 1 0 if hits = self.threshold: raise LoopDetected f"{most common key 0 } called {hits}x " f"in last {len self. calls } steps" def args hash args: dict - str: canonical = json.dumps canonicalize args , sort keys=True return hashlib.sha256 canonical.encode .hexdigest :16 VOLATILE KEYS = { "timestamp", "request id", "trace id", "span id", "nonce", "now", " ts", "correlation id", } def canonicalize value : strip keys that change every call but don't change intent if isinstance value, dict : return { k: canonicalize v for k, v in value.items if k not in VOLATILE KEYS } if isinstance value, list : return canonicalize v for v in value return value That's the whole detector. Twenty-ish lines depending on how you count the imports. Drop it in, call observe after every tool invocation, catch LoopDetected , do something useful. The hash is truncated to 16 hex chars. Collisions don't matter here. A false positive two distinct calls hashing the same costs you nothing because the loop wasn't real and the next legitimate call breaks the pattern. A false negative a real loop slipping through because the hash collided is statistically irrelevant at 16 hex chars over a 10-call window. Where to put it You have three options, ranked from worst to best. Inside the agent loop. You import LoopDetector into your agent runner and call observe after each tool call. Easy. Also brittle. The day you swap LangChain for LangGraph, or move from one framework to another mid-quarter, the detector goes with the old code. You also have to remember to instrument every new agent. The third agent your team ships in a hurry won't have it. Framework callback. LangChain has BaseCallbackHandler , LangGraph has node hooks, OpenAI's Agents SDK has lifecycle events. You write one callback that calls observe . Better than inline. Still framework-specific. Still dies when you swap. OTel span exporter. This is where it belongs. Your traces already flow through an exporter. Add a SpanProcessor that watches for tool-call spans and runs the detector on them. Framework-agnostic. Cannot be forgotten. Catches every agent in your fleet whether it was shipped today or last quarter. The placement looks like this: python from opentelemetry.sdk.trace import SpanProcessor from opentelemetry.sdk.trace.export import BatchSpanProcessor, from collections import defaultdict class LoopDetectingProcessor SpanProcessor : def init self, inner: SpanProcessor : self.inner = inner one detector per trace id self. detectors = defaultdict LoopDetector def on start self, span, parent context=None : self.inner.on start span, parent context def on end self, span - None: GenAI semconv name for a tool invocation if span.name == "execute tool": attrs = span.attributes or {} tool name = attrs.get "gen ai.tool.name", "unknown" tool args often live under gen ai.tool.call.arguments raw args = attrs.get "gen ai.tool.call.arguments", "{}" try: args = json.loads raw args except TypeError, ValueError : args = {" raw": str raw args } trace id = format span.context.trace id, "032x" detector = self. detectors trace id try: detector.observe tool name, args except LoopDetected as exc: span.set attribute "loop.detected", True span.set attribute "loop.reason", str exc signal the agent runtime via your own channel: Redis pub/sub, a kill flag in DB, etc. self.inner.on end span def shutdown self : self.inner.shutdown def force flush self, timeout millis=30000 : return self.inner.force flush timeout millis You wrap your existing exporter and register it on the tracer provider. The detector now sees every tool span from every agent your platform runs. The attribute names follow the OpenTelemetry GenAI semantic conventions gen ai.tool.name , gen ai.tool.call.arguments , so this code works with anything that emits those spans. Tuning window and threshold Defaults that hold up in practice: window=10 , threshold=4 . The reasoning. A well-behaved ReAct agent revisiting a tool because the first result was unclear will hit it twice, maybe three times with slightly different arguments. Four identical calls in ten steps means it's not exploring. It's stuck. Pushing threshold to 3 catches loops one step earlier but flags some legitimate retries. Pushing it to 5 lets one extra wasted call through per loop, which at GPT-4-class token rates is real money. If your agents have exponential backoff baked in call, wait, call again, wait longer , widen the window to 15-20 and keep threshold at 4. The backoff stretches the repetition over more steps, so a wider window catches it without being trigger-happy on legitimate retries. If your tool catalog is small 3-5 tools and the agent legitimately revisits one tool a lot, like read file in a coding agent or search web in a research agent, key on tool name, args hash not just tool name . The args hash is what separates "called search web 8 times with 8 different queries" fine from "called search web 8 times with the same query" broken . What to do on detection Three options, in increasing order of how much you trust your agent. Killswitch. Default. Raise an exception, log the loop, return a structured error to the caller. Cheap and safe. The user retries. Downgrade with a prompt. Inject a system message: "You have called search docs four times with the same arguments. The tool is returning the same result. Try a different approach or stop and report what you've found." The model usually breaks out. Sometimes it doesn't, and then the killswitch fires on the next observation. Page on-call. For agents where loops mean a real outage say, an internal autonomous tool with no user retry wire LoopDetected to PagerDuty. Rare, but for the agents that should never loop, the page is the right shape. Start with the killswitch. Move to downgrade-with-prompt only after you have data on which loops are recoverable. Two edge cases that bite Non-deterministic args. The hash will diverge on every call if your tool args include a timestamp, a request ID, or a nonce. The canonicalizer above strips a known set of volatile keys timestamp , request id , trace id , span id , nonce , now , ts , correlation id before hashing. Add to that set when you hit a new volatile field in your own tool schemas. The agent that smuggles created at: <now into its args is the agent whose loop you'll never catch otherwise. Streaming tool calls. Some frameworks emit partial spans while a tool call is still running. Filter to spans with a gen ai.tool.call.id and ignore any where the call is still streaming. Otherwise you'll count one slow tool call as multiple observations and false-positive yourself. Where this fits in your stack The detector is one of three runtime guards every production agent should have. Token budget. A hard cap on cumulative input + output tokens per agent invocation. Catches the "the prompt grew to 200K tokens" failure mode that loop detection misses. Loop detector. The thing in this post. Catches stuck repetition. Goal-completion verifier. A separate small LLM call at the end that checks "did this agent actually do what the user asked, or did it produce confident-sounding output that misses the point?" Catches the "ran for 30 steps, produced garbage" failure that the first two miss. Run all three in the trace pipeline, not inside the agent. The agent is the unreliable part. The pipeline is where the guards go. What's the worst agent loop you've shipped to production? Drop the trace in the comments. I want to know if anyone has beaten 47. If this was useful The runtime guard triad token budget, loop detector, goal verifier is one of the patterns in AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs https://www.amazon.com/dp/B0GYJZ2XJD . The book covers the rest of the production checklist: tool catalog discipline, sub-agent boundaries, replay and drift detection, and the trace-layer instrumentation that makes all of it observable. If you're shipping agents and want the patterns laid out in one place, that's the book.