I burned my Anthropic org cap and waited 3 days. Then I built llmfleet.

Author's experience hitting Anthropic's daily token rate limit while running a large batch job, which resulted in a 72-hour wait for support to reset the cap. In response, the author built **llmfleet**, a Python library that acts as a pooled dispatcher for Anthropic's `messages.create` API, using backpressure based on the API's reported remaining tokens to avoid 429 errors. Key features include configurable soft and hard token floors, a shared retry budget, a hard USD spending cap, and automatic concurrency calculation using Little's Law.

Tuesday afternoon I kicked off a re-grading job. About 18,000 prompts against claude-opus-4-7 , eight workers, each one looping messages.create as fast as it could. Forty minutes in, every call started coming back with a 429 and a header that said anthropic-ratelimit-tokens-remaining: 0 . Fine, I thought. Back off. I cut workers to four and waited. Still 429. Cut to two. Still 429. Then I noticed the cap-clear timestamp was not minutes. It was rolling. I had pushed past the daily token budget for the whole org, and a daily window does not reset in five minutes. I emailed support. They acknowledged Wednesday morning. They cleared the cap Friday afternoon. 72 hours. I am not going to claim the engineering was elegant after that. I sat there refreshing the dashboard for three days. When the cap finally cleared, I built llmfleet so I would never sit there again. What it does llmfleet is a pooled dispatcher for messages.create . You hand it a list of message payloads and a concurrency cap, and it runs them with backpressure that respects two things at once: in-flight request count, and the most recent anthropic-ratelimit-tokens-remaining header. The Sandler-inspired piece is the negotiation. Instead of a hard semaphore, the pool watches what the API tells it. If the remaining-tokens header drops under a threshold, in-flight slots get held until the window ticks. No frantic 429 retries. python import asyncio from llmfleet import Fleet fleet = Fleet api key=os.environ "ANTHROPIC API KEY" , max in flight=8, soft token floor=20 000, pause new dispatches under this hard token floor=2 000, full stop until next window payloads = {"model": "claude-opus-4-7", "max tokens": 256, "messages": {"role": "user", "content": prompt} } for prompt in prompts async def run : async for result in fleet.dispatch payloads : store result.payload id, result.response, result.cost usd asyncio.run run dispatch is an async iterator that yields results in completion order, not submission order. Each result has the original payload id, the response, latency in ms, and a cost estimate. Real numbers I cite when people ask On a single Anthropic key with no special quotas: - Messages/sec ceiling I see in practice for short prompts about 400 input tokens, 200 output : around 6.2 req/s sustained before the soft floor kicks in. - Time spent waiting at the soft floor over a 10-minute window: about 11% of wall clock. - Time spent paused at the hard floor: zero, if you set soft token floor to about 10% of your tokens-per-minute quota. That is the whole point of the soft floor. If you have higher tier quotas the numbers shift, but the shape is the same. Queue depth math The naive question is: how big should max in flight be? Sandler's answer is a Little's Law calculation. If your average latency is L seconds and you want throughput R req/s, you need at least R L concurrent calls in flight to saturate. For Claude Opus with a 200-token output and typical 4-second responses at 6 req/s, that is 24 in-flight. But the Anthropic per-minute limit on most accounts will choke you before then. So the real max in flight is min R L, perminute quota / 60 L . llmfleet does this math for you if you pass tier="default" or whatever your tier is. It logs the chosen ceiling at startup. A small detail that mattered The 429 retry that originally got me into this mess was not malicious. It was the SDK doing its default exponential backoff. Every worker was independently backing off and re-firing, which kept the cap pinned at zero for hours after the actual job was idle. llmfleet disables the SDK's internal retry. The pool owns the retry budget. One shared count. When a single request fails non-retriably, the pool can decide whether to surface or move on, and the dispatcher logs the cost of the failed attempt so it does not disappear from your budget tracking. fleet = Fleet api key=..., retry policy=dict max attempts=3, base delay=2.0, max delay=30.0 , shared retry budget per min=20 Cost guard I also added a hard USD cap because I do not trust myself at 2 AM. fleet = Fleet api key=..., max spend usd=15.00 When the running total crosses the cap, no new dispatches go out. In-flight ones still complete. The iterator yields a final BudgetExceeded marker and stops. What this does not solve - It does not raise your account quota. Three days of waiting was a quota issue, not a code issue. llmfleet keeps you under the line, not over it. - It only talks to Anthropic right now. The interface mirrors messages.create exactly. I could generalize to OpenAI, but I have not yet. - It does not do prompt caching for you. If you want that, look at cachebench . The two compose: caching reduces the tokens you count against the floor. - It does not implement priority lanes. Every payload is FIFO. If you want one job to jump the queue, run two fleets. The whole library is about 700 lines. The interesting part is the floor logic, not the queue. Repo: https://github.com/MukundaKatta/llmfleet https://github.com/MukundaKatta/llmfleet PyPI: pip install llmfleet Part of a small stack of agent-plumbing libs I keep building from real incidents. The unglamorous ones.