{"slug": "your-agent-just-called-the-same-tool-47-times-here-s-the-20-line-detector", "title": "Your Agent Just Called the Same Tool 47 Times. Here's the 20-Line Detector.", "summary": "Here is a factual summary of the article:\n\nThe article describes a common failure in AI agent systems where an agent repeatedly calls the same tool with identical arguments, wasting money (citing a case where a user lost $47,000 in one weekend). It argues that the standard fix of setting a maximum iteration limit is flawed because it penalizes legitimate long-running tasks while still allowing repetitive loops. Instead, the author provides a 20-line Python detector that uses a sliding window to track repeated (tool_name, args_hash) pairs and raises an alert if the same call appears too many times within a short window.", "body_md": "-\n**Book:**[AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs](https://www.amazon.com/dp/B0GYJZ2XJD) -\n**Also by me:*** Thinking in Go*(2-book series) —[Complete Guide to Go Programming](https://xgabriel.com/go-book)+[Hexagonal Architecture in Go](https://xgabriel.com/hexagonal-go) -\n**My project:**[Hermes IDE](https://hermes-ide.com)|[GitHub](https://github.com/hermes-hq/hermes-ide)— an IDE for developers who ship with Claude Code and other AI coding tools -\n**Me:**[xgabriel.com](https://xgabriel.com)|[GitHub](https://github.com/gabrielanhaia)\n\n## The $47K loop\n\nA LangChain user burned roughly $47,000 in a single weekend because their agent looped on one tool call. The story made the rounds on Twitter and HN in 2023, and the shape of the failure has not changed. The agent called the same retrieval tool, with the same arguments, over and over, while the framework happily fed every result back into the next prompt and billed each round.\n\nTen seconds with the trace and you'd see it. Forty-seven spans in a row, same `tool_name`\n\n, same `args`\n\npayload, different timestamps. No human writes that. No model wants to write that. But put a tool-using agent in front of a fuzzy question with a slightly-broken tool and it'll grind on the same call until something kills it.\n\nThe thing that should have killed it is twenty lines of Python. It doesn't live in the agent. It lives in the trace pipeline, so it survives framework swaps, model upgrades, and the next refactor your team does at 4pm on a Friday.\n\n## Why `max_iterations`\n\nis the wrong knob\n\nThe advice you get on the first page of Google is \"set `max_iterations=10`\n\n\". This is wrong for the same reason a speed limit on a residential street is wrong for a highway. It punishes legitimate work.\n\nConsider two agents running in the same product.\n\nAgent A is a deep research assistant. It pulls a PDF, runs a search, summarizes, follows three citations, runs three more searches, dedupes the findings, and writes a memo. Eighty tool calls, all different, all useful. The user paid for that depth.\n\nAgent B is a question-answerer with a flaky vector index. On query #1 it calls `search_docs(query=\"refund policy\")`\n\n. The result is empty because of a stale embedding. The agent reasons \"I should try again\" and calls `search_docs(query=\"refund policy\")`\n\na second time. Then a third. By step 7 it has called the exact same tool with the exact same arguments seven times in a row.\n\nA depth limit at 10 cuts off Agent A before it finishes and lets Agent B burn six iterations before it trips. You want the opposite: Agent A running as long as it's making progress, Agent B dying at iteration 4. Repetition is the signal, not depth.\n\n## The detector in 20 lines\n\nHere it is. A sliding-window counter keyed on `(tool_name, args_hash)`\n\n. Push every tool invocation. If any key shows up `threshold`\n\ntimes in the last `window`\n\ncalls, raise.\n\n``` python\nfrom collections import deque, Counter\nfrom dataclasses import dataclass, field\nimport hashlib\nimport json\n\nclass LoopDetected(Exception):\n    pass\n\n@dataclass\nclass LoopDetector:\n    window: int = 10\n    threshold: int = 4\n    _calls: deque = field(default_factory=deque)\n\n    def observe(self, tool_name: str, args: dict) -> None:\n        key = (tool_name, _args_hash(args))\n        self._calls.append(key)\n        if len(self._calls) > self.window:\n            self._calls.popleft()\n        counts = Counter(self._calls)\n        most_common_key, hits = counts.most_common(1)[0]\n        if hits >= self.threshold:\n            raise LoopDetected(\n                f\"{most_common_key[0]} called {hits}x \"\n                f\"in last {len(self._calls)} steps\"\n            )\n\ndef _args_hash(args: dict) -> str:\n    canonical = json.dumps(_canonicalize(args), sort_keys=True)\n    return hashlib.sha256(canonical.encode()).hexdigest()[:16]\n\n_VOLATILE_KEYS = {\n    \"timestamp\", \"request_id\", \"trace_id\", \"span_id\",\n    \"nonce\", \"now\", \"_ts\", \"correlation_id\",\n}\n\ndef _canonicalize(value):\n    # strip keys that change every call but don't change intent\n    if isinstance(value, dict):\n        return {\n            k: _canonicalize(v)\n            for k, v in value.items()\n            if k not in _VOLATILE_KEYS\n        }\n    if isinstance(value, list):\n        return [_canonicalize(v) for v in value]\n    return value\n```\n\nThat's the whole detector. Twenty-ish lines depending on how you count the imports. Drop it in, call `observe()`\n\nafter every tool invocation, catch `LoopDetected`\n\n, do something useful.\n\nThe hash is truncated to 16 hex chars. Collisions don't matter here. A false positive (two distinct calls hashing the same) costs you nothing because the loop wasn't real and the next legitimate call breaks the pattern. A false negative (a real loop slipping through because the hash collided) is statistically irrelevant at 16 hex chars over a 10-call window.\n\n## Where to put it\n\nYou have three options, ranked from worst to best.\n\n**Inside the agent loop.** You import `LoopDetector`\n\ninto your agent runner and call `observe()`\n\nafter each tool call. Easy. Also brittle. The day you swap LangChain for LangGraph, or move from one framework to another mid-quarter, the detector goes with the old code. You also have to remember to instrument every new agent. The third agent your team ships in a hurry won't have it.\n\n**Framework callback.** LangChain has `BaseCallbackHandler`\n\n, LangGraph has node hooks, OpenAI's Agents SDK has lifecycle events. You write one callback that calls `observe()`\n\n. Better than inline. Still framework-specific. Still dies when you swap.\n\n**OTel span exporter.** This is where it belongs. Your traces already flow through an exporter. Add a `SpanProcessor`\n\nthat watches for tool-call spans and runs the detector on them. Framework-agnostic. Cannot be forgotten. Catches every agent in your fleet whether it was shipped today or last quarter.\n\nThe placement looks like this:\n\n``` python\nfrom opentelemetry.sdk.trace import SpanProcessor\nfrom opentelemetry.sdk.trace.export import (\n    BatchSpanProcessor,\n)\nfrom collections import defaultdict\n\nclass LoopDetectingProcessor(SpanProcessor):\n    def __init__(self, inner: SpanProcessor):\n        self.inner = inner\n        # one detector per trace_id\n        self._detectors = defaultdict(LoopDetector)\n\n    def on_start(self, span, parent_context=None):\n        self.inner.on_start(span, parent_context)\n\n    def on_end(self, span) -> None:\n        # GenAI semconv name for a tool invocation\n        if span.name == \"execute_tool\":\n            attrs = span.attributes or {}\n            tool_name = attrs.get(\n                \"gen_ai.tool.name\", \"unknown\"\n            )\n            # tool args often live under gen_ai.tool.call.arguments\n            raw_args = attrs.get(\n                \"gen_ai.tool.call.arguments\", \"{}\"\n            )\n            try:\n                args = json.loads(raw_args)\n            except (TypeError, ValueError):\n                args = {\"_raw\": str(raw_args)}\n\n            trace_id = format(span.context.trace_id, \"032x\")\n            detector = self._detectors[trace_id]\n            try:\n                detector.observe(tool_name, args)\n            except LoopDetected as exc:\n                span.set_attribute(\"loop.detected\", True)\n                span.set_attribute(\"loop.reason\", str(exc))\n                # signal the agent runtime via your own channel:\n                # Redis pub/sub, a kill flag in DB, etc.\n        self.inner.on_end(span)\n\n    def shutdown(self):\n        self.inner.shutdown()\n\n    def force_flush(self, timeout_millis=30000):\n        return self.inner.force_flush(timeout_millis)\n```\n\nYou wrap your existing exporter and register it on the tracer provider. The detector now sees every tool span from every agent your platform runs. The attribute names follow the OpenTelemetry GenAI semantic conventions (`gen_ai.tool.name`\n\n, `gen_ai.tool.call.arguments`\n\n), so this code works with anything that emits those spans.\n\n## Tuning window and threshold\n\nDefaults that hold up in practice: `window=10`\n\n, `threshold=4`\n\n.\n\nThe reasoning. A well-behaved ReAct agent revisiting a tool because the first result was unclear will hit it twice, maybe three times with slightly different arguments. Four identical calls in ten steps means it's not exploring. It's stuck. Pushing threshold to 3 catches loops one step earlier but flags some legitimate retries. Pushing it to 5 lets one extra wasted call through per loop, which at GPT-4-class token rates is real money.\n\nIf your agents have exponential backoff baked in (call, wait, call again, wait longer), widen the window to 15-20 and keep threshold at 4. The backoff stretches the repetition over more steps, so a wider window catches it without being trigger-happy on legitimate retries.\n\nIf your tool catalog is small (3-5 tools) and the agent legitimately revisits one tool a lot, like `read_file`\n\nin a coding agent or `search_web`\n\nin a research agent, key on `(tool_name, args_hash)`\n\nnot just `tool_name`\n\n. The args hash is what separates \"called search_web 8 times with 8 different queries\" (fine) from \"called search_web 8 times with the same query\" (broken).\n\n## What to do on detection\n\nThree options, in increasing order of how much you trust your agent.\n\n**Killswitch.** Default. Raise an exception, log the loop, return a structured error to the caller. Cheap and safe. The user retries.\n\n**Downgrade with a prompt.** Inject a system message: *\"You have called search_docs four times with the same arguments. The tool is returning the same result. Try a different approach or stop and report what you've found.\"* The model usually breaks out. Sometimes it doesn't, and then the killswitch fires on the next observation.\n\n**Page on-call.** For agents where loops mean a real outage (say, an internal autonomous tool with no user retry) wire `LoopDetected`\n\nto PagerDuty. Rare, but for the agents that should never loop, the page is the right shape.\n\nStart with the killswitch. Move to downgrade-with-prompt only after you have data on which loops are recoverable.\n\n## Two edge cases that bite\n\n**Non-deterministic args.** The hash will diverge on every call if your tool args include a timestamp, a request ID, or a nonce. The canonicalizer above strips a known set of volatile keys (`timestamp`\n\n, `request_id`\n\n, `trace_id`\n\n, `span_id`\n\n, `nonce`\n\n, `now`\n\n, `_ts`\n\n, `correlation_id`\n\n) before hashing. Add to that set when you hit a new volatile field in your own tool schemas. The agent that smuggles `created_at: <now>`\n\ninto its args is the agent whose loop you'll never catch otherwise.\n\n**Streaming tool calls.** Some frameworks emit partial spans while a tool call is still running. Filter to spans with a `gen_ai.tool.call.id`\n\nand ignore any where the call is still streaming. Otherwise you'll count one slow tool call as multiple observations and false-positive yourself.\n\n## Where this fits in your stack\n\nThe detector is one of three runtime guards every production agent should have.\n\nToken budget. A hard cap on cumulative input + output tokens per agent invocation. Catches the \"the prompt grew to 200K tokens\" failure mode that loop detection misses.\n\nLoop detector. The thing in this post. Catches stuck repetition.\n\nGoal-completion verifier. A separate small LLM call at the end that checks \"did this agent actually do what the user asked, or did it produce confident-sounding output that misses the point?\" Catches the \"ran for 30 steps, produced garbage\" failure that the first two miss.\n\nRun all three in the trace pipeline, not inside the agent. The agent is the unreliable part. The pipeline is where the guards go.\n\nWhat's the worst agent loop you've shipped to production? Drop the trace in the comments. I want to know if anyone has beaten 47.\n\n## If this was useful\n\nThe runtime guard triad (token budget, loop detector, goal verifier) is one of the patterns in [AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs](https://www.amazon.com/dp/B0GYJZ2XJD). The book covers the rest of the production checklist: tool catalog discipline, sub-agent boundaries, replay and drift detection, and the trace-layer instrumentation that makes all of it observable. If you're shipping agents and want the patterns laid out in one place, that's the book.", "url": "https://wpnews.pro/news/your-agent-just-called-the-same-tool-47-times-here-s-the-20-line-detector", "canonical_source": "https://dev.to/gabrielanhaia/your-agent-just-called-the-same-tool-47-times-heres-the-20-line-detector-59f1", "published_at": "2026-05-23 16:52:09+00:00", "updated_at": "2026-05-23 17:02:16.710915+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "developer-tools"], "entities": ["LangChain"], "alternates": {"html": "https://wpnews.pro/news/your-agent-just-called-the-same-tool-47-times-here-s-the-20-line-detector", "markdown": "https://wpnews.pro/news/your-agent-just-called-the-same-tool-47-times-here-s-the-20-line-detector.md", "text": "https://wpnews.pro/news/your-agent-just-called-the-same-tool-47-times-here-s-the-20-line-detector.txt", "jsonld": "https://wpnews.pro/news/your-agent-just-called-the-same-tool-47-times-here-s-the-20-line-detector.jsonld"}}