Why Rate Limits Kill Your AI Agents in Production (And the Patterns That Actually Work)

A developer explains that LLM API calls fail 1-5% of the time in production due to unhandled 429 errors, not hallucinations. Rate limits, especially tokens per minute (TPM), cause retry storms that spike costs and degrade agent performance. Proper token-aware rate limiting can cut redundant API costs by 40%.

LLM API calls fail between 1% and 5% of the time in production. Not from hallucinations. From 429 errors nobody handled. You've probably seen this: you ship an agent, everything works in staging, prod hits a burst of traffic, the provider throttles you, and suddenly your agent is retrying forever, burning tokens on every attempt, and the cost graph spikes sideways. The incident isn't model quality. It's the retry loop you forgot to fence. I've written about production architecture https://mudassirkhan.me/blog/agentic-ai-production-architecture for agentic systems before. Rate limiting is the piece that bites teams hardest, so let me go deep on it here. Most developers come to LLMs from REST APIs where rate limiting is mostly a nuisance you handle with one retry. With LLMs, the shape of the problem is different. A single agent request isn't one API call. It's potentially dozens: planner calls, tool calls, summarizer calls, verifier calls. They all share the same rate window. One agent serving 10 simultaneous users can hit 200 to 300 API calls per minute before you realize what's happening. The other thing that surprises teams: LLM rate limits often count tokens, not requests. Two requests at 50 tokens each and one request at 10,000 tokens are not equal, but your requests per minute counter treats them the same. You can stay under RPM and blow right past TPM. The most common production incident isn't a model giving the wrong answer. It's an agent that decides to retry, then retries again, each retry being a full provider call with no delay logic to protect the window. This is the retry storm. You hit a rate limit, your agent retries immediately or with a fixed 1 second delay , the retried call also hits the limit, all the retries queue up at the rate window boundary and fire at once, and now you've turned a temporary throttle into a sustained overload. In multiagent systems it compounds. One orchestrator spawning five subagents, each doing their own uncoordinated retries, can turn a single 429 into 50 retry attempts within the same second. This is one of the core failure patterns I covered in why AI agents fail in production https://mudassirkhan.me/blog/why-ai-agents-fail-production . Proper rate limit handling can cut redundant API costs by 40%. That's not a small rounding error. That's architectural discipline that pays for itself. Your LLM provider gives you two limits: requests per minute RPM and tokens per minute TPM . Most teams watch RPM. TPM is usually what breaks you. Here's why: 1 request can be 50 tokens or 10,000 tokens. If you only count requests and stay under RPM, a single heavy prompt large context, long output can exhaust your TPM budget while you're still well under RPM. The next 30 requests all get 429s even though you've only made 5 calls that minute. The fix is to count tokens on the way out, not after the call fails: interface RateLimiter { requestTokens estimatedTokens: number : Promise<void ; recordUsage actualTokens: number : void; } class TokenBucketLimiter implements RateLimiter { private bucketTokens: number; private lastRefill: number; constructor private tpmLimit: number, private refillIntervalMs = 60 000 { this.bucketTokens = tpmLimit; this.lastRefill = Date.now ; } async requestTokens estimatedTokens: number : Promise<void { this.refillIfNeeded ; if this.bucketTokens < estimatedTokens { const waitMs = this.msUntilRefill ; await new Promise resolve = setTimeout resolve, waitMs ; this.refillIfNeeded ; } this.bucketTokens -= estimatedTokens; } recordUsage actualTokens: number : void { // Adjust the bucket if the actual usage differed from the estimate. // Track pre-reserved vs actual in a real implementation. void actualTokens; } private refillIfNeeded : void { const now = Date.now ; if now - this.lastRefill = this.refillIntervalMs { this.bucketTokens = this.tpmLimit; this.lastRefill = now; } } private msUntilRefill : number { return this.refillIntervalMs - Date.now - this.lastRefill ; } } Estimate tokens before the call using tiktoken or a rough char/4 heuristic, consume from the bucket, wait if you're over budget. This moves the rate limit behavior from reactive catch the 429 to proactive don't send the call that would fail . The teams that handle rate limits cleanly aren't just retrying smarter. They're operating at three layers simultaneously. Layer 1: token bucket per user, model . Limit each user's consumption independently. A single heavy user doesn't starve everyone else. Scope the bucket to both the user and the model so a cheap model and an expensive one don't compete for the same budget. Layer 2: circuit breakers. Three signals should trip a circuit breaker: Layer 3: declarative fallback chain. Primary model → cheaper model e.g. GPT-4o → GPT-4o mini → semantic cache return a stored response for similar queries → 503. The chain is declarative, not imperative. You configure the fallback in one place and every agent inherits it. The reason a fixed retry delay creates storm conditions is that all retried requests fire at the same moment. Jitter desynchronizes them. interface RetryConfig { maxAttempts: number; baseDelayMs: number; maxDelayMs: number; jitterFactor: number; // 0 to 1, how much randomness to add } async function withExponentialBackoff<T fn: = Promise<T , config: RetryConfig = { maxAttempts: 3, baseDelayMs: 1 000, maxDelayMs: 30 000, jitterFactor: 0.3, } : Promise<T { let lastError: Error; for let attempt = 0; attempt < config.maxAttempts; attempt++ { try { return await fn ; } catch error { lastError = error as Error; if isRateLimitError error || attempt === config.maxAttempts - 1 { throw error; } const exponentialDelay = config.baseDelayMs Math.pow 2, attempt ; const cappedDelay = Math.min exponentialDelay, config.maxDelayMs ; const jitter = cappedDelay config.jitterFactor Math.random ; const finalDelay = cappedDelay + jitter; await new Promise resolve = setTimeout resolve, finalDelay ; } } throw lastError ; } function isRateLimitError error: unknown : boolean { if error instanceof Error { return error.message.includes '429' || error.message.toLowerCase .includes 'rate limit' ; } return false; } The key is Math.random in the jitter calculation. Two simultaneous retries sleep for different durations and arrive at the provider at different moments. At scale this turns a synchronized wave into a spread distribution. Also worth checking: the OpenAI 429 response includes a Retry-After header telling you exactly how many seconds to wait. Parse it and honor it directly instead of running your own backoff math. A circuit breaker wraps your LLM client and opens stops sending when the error rate crosses a threshold. Here's a minimal implementation: type CircuitState = 'closed' | 'open' | 'half-open'; class LLMCircuitBreaker { private state: CircuitState = 'closed'; private failureCount = 0; private lastFailureTime = 0; private readonly failureThreshold = 5; private readonly recoveryTimeMs = 60 000; async call<T fn: = Promise<T : Promise<T { if this.state === 'open' { if Date.now - this.lastFailureTime this.recoveryTimeMs { this.state = 'half-open'; } else { throw new Error 'Circuit open: LLM provider is throttling. Try again shortly.' ; } } try { const result = await fn ; this.onSuccess ; return result; } catch error { this.onFailure ; throw error; } } private onSuccess : void { this.failureCount = 0; this.state = 'closed'; } private onFailure : void { this.failureCount++; this.lastFailureTime = Date.now ; if this.failureCount = this.failureThreshold { this.state = 'open'; } } } // Wire the two layers together: const breaker = new LLMCircuitBreaker ; const limiter = new TokenBucketLimiter 100 000 ; async function safeLLMCall prompt: string, estimatedTokens: number { await limiter.requestTokens estimatedTokens ; return breaker.call = openai.chat.completions.create { model: 'gpt-4o', messages: { role: 'user', content: prompt } , } ; } The circuit opens after 5 failures and stays open for 60 seconds. During that window, requests fail fast instead of piling up waiting for a timeout. After 60 seconds it shifts to half-open : one test call goes through, and if it succeeds the circuit closes again. If it fails, the clock resets. Put this at the call site of every provider interaction and you've got a fence around every agent that uses it. What causes LLM rate limit errors? Two things. You've hit the provider's RPM or TPM ceiling for your account tier, or a single request exceeded the context window limit. Check both when you see a 429. The error response usually tells you which limit you hit. How do I handle 429 errors from OpenAI? The OpenAI 429 response includes a Retry-After header with how many seconds to wait. Parse it and sleep for that duration before retrying. The header value is more reliable than any backoff formula you'll calculate yourself. What is exponential backoff for APIs? Instead of retrying immediately or on a fixed interval, each retry waits longer than the previous one: 1 second, then 2 seconds, then 4 seconds, then 8 seconds. Adding random jitter to each wait time prevents all retriers from firing at the same moment and overloading the provider again. How do I prevent agent retry storms? Two controls working together. A circuit breaker that opens after N consecutive failures stops new calls from going out while the provider is saturated. A token bucket that estimates usage before sending catches bursts before they hit the API. The combination prevents the feedback loop where retries cause more rate limits cause more retries. If you're building out the rest of the agent reliability layer, I go deeper into the architectural patterns on my blog. If you want this wired up on your own stack end to end, agentic AI consulting is exactly the kind of work I take on. Drop a comment if your rate limit setup looks different. Curious whether people are managing this at the SDK layer or at an API gateway.