Why Rate Limits Kill Your AI Agents in Production (And the Patterns That Actually Work) A developer explains that LLM API calls fail 1-5% of the time in production due to unhandled 429 errors, not hallucinations. Rate limits, especially tokens per minute (TPM), cause retry storms that spike costs and degrade agent performance. Proper token-aware rate limiting can cut redundant API costs by 40%. LLM API calls fail between 1% and 5% of the time in production. Not from hallucinations. From 429 errors nobody handled. You've probably seen this: you ship an agent, everything works in staging, prod hits a burst of traffic, the provider throttles you, and suddenly your agent is retrying forever, burning tokens on every attempt, and the cost graph spikes sideways. The incident isn't model quality. It's the retry loop you forgot to fence. I've written about production architecture https://mudassirkhan.me/blog/agentic-ai-production-architecture for agentic systems before. Rate limiting is the piece that bites teams hardest, so let me go deep on it here. Most developers come to LLMs from REST APIs where rate limiting is mostly a nuisance you handle with one retry. With LLMs, the shape of the problem is different. A single agent request isn't one API call. It's potentially dozens: planner calls, tool calls, summarizer calls, verifier calls. They all share the same rate window. One agent serving 10 simultaneous users can hit 200 to 300 API calls per minute before you realize what's happening. The other thing that surprises teams: LLM rate limits often count tokens, not requests. Two requests at 50 tokens each and one request at 10,000 tokens are not equal, but your requests per minute counter treats them the same. You can stay under RPM and blow right past TPM. The most common production incident isn't a model giving the wrong answer. It's an agent that decides to retry, then retries again, each retry being a full provider call with no delay logic to protect the window. This is the retry storm. You hit a rate limit, your agent retries immediately or with a fixed 1 second delay , the retried call also hits the limit, all the retries queue up at the rate window boundary and fire at once, and now you've turned a temporary throttle into a sustained overload. In multiagent systems it compounds. One orchestrator spawning five subagents, each doing their own uncoordinated retries, can turn a single 429 into 50 retry attempts within the same second. This is one of the core failure patterns I covered in why AI agents fail in production https://mudassirkhan.me/blog/why-ai-agents-fail-production . Proper rate limit handling can cut redundant API costs by 40%. That's not a small rounding error. That's architectural discipline that pays for itself. Your LLM provider gives you two limits: requests per minute RPM and tokens per minute TPM . Most teams watch RPM. TPM is usually what breaks you. Here's why: 1 request can be 50 tokens or 10,000 tokens. If you only count requests and stay under RPM, a single heavy prompt large context, long output can exhaust your TPM budget while you're still well under RPM. The next 30 requests all get 429s even though you've only made 5 calls that minute. The fix is to count tokens on the way out, not after the call fails: interface RateLimiter { requestTokens estimatedTokens: number : Promise