cd /news/large-language-models/i-burned-my-anthropic-org-cap-and-wa… · home topics large-language-models article
[ARTICLE · art-4275] src=dev.to pub= topic=large-language-models verified=true sentiment=↑ positive

I burned my Anthropic org cap and waited 3 days. Then I built llmfleet.

Author's experience hitting Anthropic's daily token rate limit while running a large batch job, which resulted in a 72-hour wait for support to reset the cap. In response, the author built **llmfleet**, a Python library that acts as a pooled dispatcher for Anthropic's `messages.create` API, using backpressure based on the API's reported remaining tokens to avoid 429 errors. Key features include configurable soft and hard token floors, a shared retry budget, a hard USD spending cap, and automatic concurrency calculation using Little's Law.

read4 min views7 publishedMay 21, 2026

Tuesday afternoon I kicked off a re-grading job. About 18,000 prompts against claude-opus-4-7

, eight workers, each one looping messages.create

as fast as it could.

Forty minutes in, every call started coming back with a 429 and a header that said anthropic-ratelimit-tokens-remaining: 0

. Fine, I thought. Back off. I cut workers to four and waited. Still 429. Cut to two. Still 429.

Then I noticed the cap-clear timestamp was not minutes. It was rolling. I had pushed past the daily token budget for the whole org, and a daily window does not reset in five minutes.

I emailed support. They acknowledged Wednesday morning. They cleared the cap Friday afternoon. 72 hours.

I am not going to claim the engineering was elegant after that. I sat there refreshing the dashboard for three days. When the cap finally cleared, I built llmfleet

so I would never sit there again.

What it does #

llmfleet

is a pooled dispatcher for messages.create

. You hand it a list of message payloads and a concurrency cap, and it runs them with backpressure that respects two things at once: in-flight request count, and the most recent anthropic-ratelimit-tokens-remaining

header.

The Sandler-inspired piece is the negotiation. Instead of a hard semaphore, the pool watches what the API tells it. If the remaining-tokens header drops under a threshold, in-flight slots get held until the window ticks. No frantic 429 retries.

import asyncio
from llmfleet import Fleet

fleet = Fleet(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    max_in_flight=8,
    soft_token_floor=20_000,   #  new dispatches under this
    hard_token_floor=2_000,    # full stop until next window
)

payloads = [
    {"model": "claude-opus-4-7", "max_tokens": 256,
     "messages": [{"role": "user", "content": prompt}]}
    for prompt in prompts
]

async def run():
    async for result in fleet.dispatch(payloads):
        store(result.payload_id, result.response, result.cost_usd)

asyncio.run(run())

dispatch

is an async iterator that yields results in completion order, not submission order. Each result has the original payload id, the response, latency in ms, and a cost estimate.

Real numbers I cite when people ask #

On a single Anthropic key with no special quotas:

  • Messages/sec ceiling I see in practice for short prompts (about 400 input tokens, 200 output): around 6.2 req/s sustained before the soft floor kicks in.
  • Time spent waiting at the soft floor over a 10-minute window: about 11% of wall clock.
  • Time spent d at the hard floor: zero, if you set soft_token_floor

to about 10% of your tokens-per-minute quota. That is the whole point of the soft floor.

If you have higher tier quotas the numbers shift, but the shape is the same.

Queue depth math #

The naive question is: how big should max_in_flight

be?

Sandler's answer is a Little's Law calculation. If your average latency is L seconds and you want throughput R req/s, you need at least R*L concurrent calls in flight to saturate.

For Claude Opus with a 200-token output and typical 4-second responses at 6 req/s, that is 24 in-flight. But the Anthropic per-minute limit on most accounts will choke you before then. So the real max_in_flight

is min(R*L, perminute_quota / 60 * L)

.

llmfleet

does this math for you if you pass tier="default"

or whatever your tier is. It logs the chosen ceiling at startup.

A small detail that mattered #

The 429 retry that originally got me into this mess was not malicious. It was the SDK doing its default exponential backoff. Every worker was independently backing off and re-firing, which kept the cap pinned at zero for hours after the actual job was idle.

llmfleet

disables the SDK's internal retry. The pool owns the retry budget. One shared count. When a single request fails non-retriably, the pool can decide whether to surface or move on, and the dispatcher logs the cost of the failed attempt so it does not disappear from your budget tracking.

fleet = Fleet(api_key=...,
              retry_policy=dict(max_attempts=3, base_delay=2.0, max_delay=30.0),
              shared_retry_budget_per_min=20)

Cost guard #

I also added a hard USD cap because I do not trust myself at 2 AM.

fleet = Fleet(api_key=..., max_spend_usd=15.00)

When the running total crosses the cap, no new dispatches go out. In-flight ones still complete. The iterator yields a final BudgetExceeded

marker and stops.

What this does not solve #

  • It does not raise your account quota. Three days of waiting was a quota issue, not a code issue. llmfleet

keeps you under the line, not over it. - It only talks to Anthropic right now. The interface mirrors messages.create

exactly. I could generalize to OpenAI, but I have not yet. - It does not do prompt caching for you. If you want that, look at cachebench

. The two compose: caching reduces the tokens you count against the floor. - It does not implement priority lanes. Every payload is FIFO. If you want one job to jump the queue, run two fleets.

The whole library is about 700 lines. The interesting part is the floor logic, not the queue.

Repo: https://github.com/MukundaKatta/llmfleet

PyPI: pip install llmfleet

Part of a small stack of agent-plumbing libs I keep building from real incidents. The unglamorous ones.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-burned-my-anthropi…] indexed:0 read:4min 2026-05-21 ·