{"slug": "how-i-tamed-ai-api-rate-limits-with-a-simple-queue", "title": "How I Tamed AI API Rate Limits with a Simple Queue", "summary": "A developer built a content generation tool using the OpenAI API and encountered rate limit errors when scaling from 5 to 200 topics. They implemented a solution combining exponential backoff with jitter, a rate limiter using a token bucket approach, and a worker thread pool with a task queue to manage request pacing and eliminate 429 errors.", "body_md": "A few months back, I was building a content generation tool. The idea was simple: take a list of topics, hit the OpenAI API, and get SEO-optimized articles. My prototype worked great with 5 topics. Then I scaled to 50. Then 200.\n\nThat’s when the 429s started flooding my logs. Rate limited. Overloaded. Blocked.\n\nI was frustrated. Not because the API was unstable — it’s actually very reliable — but because I hadn’t thought about the *pace* of my requests. Every failed call meant lost time, wasted retries, and eventually a complete stall while I waited for the cooldown to end.\n\nMy first attempt was naïve: just wrap the call in a `try/except`\n\nand retry after a fixed 5 seconds.\n\n``` python\ndef call_api(prompt):\n    while True:\n        try:\n            response = openai.Completion.create(...)\n            return response\n        except openai.error.RateLimitError:\n            time.sleep(5)\n```\n\nThis worked… until I had 10 concurrent threads all sleeping at the same time, then waking up together and slamming the API again. The 429s came back in waves. Plus the fixed delay was either too short (still getting rate limited) or too long (wasting time).\n\nI tried increasing the sleep to 30 seconds. That helped, but now my throughput was abysmal. One request every 30 seconds? For 200 topics, that’s nearly two hours. I needed something smarter.\n\nI knew the theory — exponential backoff with jitter — but I’d never implemented it properly. Here’s what I built step-by-step.\n\nInstead of firing requests in parallel uncontrolled, I put all tasks into a `queue.Queue`\n\nand spawned a fixed number of worker threads. Each worker would pull a task, call the API, and if it failed, push the task back onto the queue with a delay.\n\n``` python\nimport queue\nimport threading\nimport time\nimport random\nfrom functools import wraps\n\ndef retry_with_exponential_backoff(max_retries=5, base_delay=1, max_delay=60):\n    def decorator(func):\n        @wraps(func)\n        def wrapper(*args, **kwargs):\n            retries = 0\n            while True:\n                try:\n                    return func(*args, **kwargs)\n                except openai.error.RateLimitError:\n                    if retries >= max_retries:\n                        raise\n                    delay = min(base_delay * (2 ** retries) + random.uniform(0, 1), max_delay)\n                    time.sleep(delay)\n                    retries += 1\n        return wrapper\n    return decorator\n\n@retry_with_exponential_backoff(max_retries=5, base_delay=2)\ndef call_openai(prompt):\n    # ... actual API call\n```\n\nThe decorator handles individual retries, but I still needed to prevent all workers from retrying at the same time. I added a global semaphore and a token bucket approach.\n\n``` python\nfrom threading import Semaphore, Lock\nimport time\n\nclass RateLimiter:\n    def __init__(self, calls_per_minute=10):\n        self.calls_per_minute = calls_per_minute\n        self.interval = 60.0 / calls_per_minute\n        self.last_call = time.time()\n        self.lock = Lock()\n\n    def wait(self):\n        with self.lock:\n            elapsed = time.time() - self.last_call\n            if elapsed < self.interval:\n                time.sleep(self.interval - elapsed)\n            self.last_call = time.time()\n```\n\nThen I used this inside each worker:\n\n```\nrate_limiter = RateLimiter(calls_per_minute=10)\n\n@retry_with_exponential_backoff()\ndef safe_call(prompt):\n    rate_limiter.wait()\n    return call_openai(prompt)\npython\ndef worker():\n    while True:\n        try:\n            prompt = task_queue.get(timeout=5)\n        except queue.Empty:\n            break\n        try:\n            result = safe_call(prompt)\n            # store result\n        except Exception as e:\n            # log and potentially re-queue\n            pass\n        finally:\n            task_queue.task_done()\n\nnum_workers = 4\ntask_queue = queue.Queue()\nthreads = []\nfor _ in range(num_workers):\n    t = threading.Thread(target=worker)\n    t.start()\n    threads.append(t)\n\n# Add all prompts to the queue\nfor prompt in prompts:\n    task_queue.put(prompt)\n\ntask_queue.join()  # wait for all tasks to complete\n```\n\nWith this setup, my 200 topics finished in about 20 minutes — a 6x improvement over the naïve approach — and I never hit a 429 past the first retry.\n\nThis pattern works great for batch jobs where latency isn’t critical. For real-time user-facing applications (like chat), you’d want a different approach — maybe pre-allocate a pool of connections or use a managed gateway that handles retries and throttling for you.\n\nIf you don’t want to build the queue and rate limiter yourself, there are services that wrap all this (like the one at `https://ai.interwestinfo.com/`\n\n). They handle concurrency, retries, and even load balancing across multiple API keys. But for most personal projects, the 50 lines above are enough.\n\nI’d start with the queue from day one. I’d also add proper structured logging so I can trace each request’s retry history. And I’d use `asyncio`\n\ninstead of threading to keep the code simpler — `asyncio`\n\n’s `wait_for`\n\nand `sleep`\n\nmake the rate limiter cleaner.\n\nAlso, I’d benchmark the optimal number of workers for my rate limit. Too many workers and they’re all sleeping; too few and you’re underutilizing the quota.\n\nHave you hit similar walls with API rate limits? What’s your go-to pattern for handling them? I’m curious if anyone’s used a different backoff formula or a distributed queue like Redis for cross-process rate limiting. Let me know in the comments!", "url": "https://wpnews.pro/news/how-i-tamed-ai-api-rate-limits-with-a-simple-queue", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/how-i-tamed-ai-api-rate-limits-with-a-simple-queue-198k", "published_at": "2026-06-17 02:00:33+00:00", "updated_at": "2026-06-17 02:51:39.456664+00:00", "lang": "en", "topics": ["developer-tools", "large-language-models", "ai-tools"], "entities": ["OpenAI"], "alternates": {"html": "https://wpnews.pro/news/how-i-tamed-ai-api-rate-limits-with-a-simple-queue", "markdown": "https://wpnews.pro/news/how-i-tamed-ai-api-rate-limits-with-a-simple-queue.md", "text": "https://wpnews.pro/news/how-i-tamed-ai-api-rate-limits-with-a-simple-queue.txt", "jsonld": "https://wpnews.pro/news/how-i-tamed-ai-api-rate-limits-with-a-simple-queue.jsonld"}}