The Cheaper API Was 2.5x Cheaper. It Cost 1.6x More. A developer's analysis reveals that choosing an API based solely on per-call pricing can be misleading, as a cheaper per-call API ended up costing 1.63x more per successful task due to a lower success rate and higher retry costs. The formula cost per successful task = cost per attempt ÷ success rate demonstrates that the true cost depends on success rates, not just sticker prices. The developer provides a deterministic Python script to illustrate how a cheap tier with a 35% failure rate can burn its discount on retries, making a more expensive but reliable tier cheaper per completed task. AI-disclosure:AI-assisted draft, human-reviewed. The demo numbers are the verbatim stdout of a deterministic, stdlib-only Python script included in full below — re-run it and you get the same bytes. The attempt counts in that script are a SYNTHETIC fixture I chose to exercise the accounting mechanism, calibrated to the retry skew I see in my own scraper logs run counts from my Apify history . It is NOT a benchmark of any named vendor's API or prices. The one external claim the cost-per-successful-task formula is attributed and linked. The cheaper API was 2.5x cheaper per call. The monthly bill came in higher anyway. Not by a rounding error. The "cheap" option cost 1.63x more per successful task than the one with the bigger sticker price. Same workload. The price page never showed me that number, because the price page doesn't know your success rate. You do — after you've already paid. This is the arithmetic the per-call price hides. And it's a decision you make before you spend, not a cap you bolt on after. price per attempt × attempts ÷ successes . A cheap tier with a low success rate burns its discount on retries. $0.0020 /call but $0.0096 /success; robust tier $0.0050 /call but $0.0059 /success. The sticker winner Here's the trap, stated plainly. The number on the pricing page is per call . The number on your invoice is per call too — but the value you got is per successful task . Those are different denominators, and the gap between them is exactly the work that failed. Every attempt is billed. The one that timed out and got retried: billed. The one that came back malformed and you re-prompted: billed. The one that succeeded on the fourth try: billed four times. If a tier fails 35% of its tasks and burns three to six attempts chasing each hard one, you are paying for a lot of calls that produced nothing you can use. So the real question isn't "which tier is cheaper per call." It's "which tier is cheaper per task I actually completed." Those can point at different tiers. When they do, ranking by the sticker picks the loser. The formula is small enough to fit in a sentence: cost per successful task = cost per attempt ÷ success rate Codebridge put it the same way in a February 2026 write-up titled, literally, Real Cost per Successful Task : "a model that costs $0.01 per attempt but succeeds only 50% of the time effectively costs $0.02 per success," and "the gap between attempted tasks and completed outcomes contains the bulk of real-world cost." codebridge.tech https://www.codebridge.tech/articles/ai-agent-development-cost-real-cost-per-successful-task Same mechanism. My contribution here isn't the formula — it's showing the ranking flip with a number you can reproduce, and where the realistic retry shape comes from. I wrote a small script to make the flip concrete. It's deterministic — stdlib only, no network, no random, no clock. Two tiers, 40 tasks each. For every task it records how many billed attempts it took and whether it ultimately succeeded. Then it computes three numbers per tier: per call, per task spend spread over all tasks , and per successful task. One honest caveat up front, because it matters: the attempt counts are a synthetic fixture I wrote by hand — numbers I chose to exercise the mechanism. They are not a measurement of any named vendor. What makes them realistic rather than arbitrary is that I shaped the skew to mirror what I see in my own scraper production logs across 2,190 lifetime runs: the cheap, flaky source eats far more retries per success than the stable one. The mechanism is real. The specific cells are illustrative. Swap in your own and the script does the same arithmetic. bash /usr/bin/env python3 cost per successful task.py Deterministic, stdlib-only, no network. Fixture is inlined below. Question this answers: You pick the option with the cheaper per-call price. Is it actually cheaper PER SUCCESSFUL TASK once you pay for the failed attempts and retries? Mechanism the whole point : true cost-per-success = price per attempt attempts spent / successes A cheap-per-attempt option with a low success rate makes you pay for the wasted attempts on every retry. The headline price lies. The denominator that matters is successful tasks , not calls. This is NOT an LLM benchmark. It is a stdlib simulation of the accounting mechanism. The attempt counts are a fixed, hand-written fixture no RNG , chosen to mirror the retry skew we see in our own scraper production logs 2,190 lifetime runs : the "cheap" tier eats far more retries per success. PRICE = { price charged PER ATTEMPT every attempt is billed, success or fail "cheap tier": 0.0020, looks 2.5x cheaper per call "robust tier": 0.0050, } Fixture: for each task we record how many BILLED attempts it took, and whether it ultimately SUCCEEDED. Deterministic, written out by hand so the run is fully reproducible. 40 tasks per tier. - cheap tier: low success rate, heavy retrying mirrors our flaky-source logs: the cheap option fails ~40% of tasks and burns 3-6 billed attempts chasing each one before giving up or limping to a success - robust tier: high success rate, almost always first-try Each entry = attempts billed, succeeded TASKS = { "cheap tier": 6, False , 1, True , 5, False , 2, True , 1, True , 6, False , 5, False , 2, True , 1, True , 4, True , 6, False , 3, True , 1, True , 2, True , 5, False , 1, True , 6, False , 2, True , 5, False , 1, True , 2, True , 3, True , 6, False , 1, True , 2, True , 6, False , 1, True , 4, True , 5, False , 1, True , 6, False , 2, True , 1, True , 5, False , 2, True , 1, True , 3, True , 6, False , 2, True , 1, True , , "robust tier": 1, True , 1, True , 1, True , 2, True , 1, True , 1, True , 1, True , 1, True , 2, True , 1, True , 1, True , 1, False , 1, True , 1, True , 1, True , 2, True , 1, True , 1, True , 1, True , 1, True , 1, True , 1, True , 2, True , 1, True , 1, True , 1, True , 1, True , 1, True , 1, True , 2, True , 1, True , 1, True , 1, True , 1, True , 1, True , 1, True , 2, True , 1, True , 1, True , 1, True , , } def summarize tier : rows = TASKS tier price = PRICE tier n tasks = len rows attempts = sum a for a, in rows successes = sum 1 for , ok in rows if ok spend = attempts price success rate = successes / n tasks naive per call = price the sticker price you compare naive per task = spend / n tasks spend spread over ALL tasks true per success = spend / successes the number that pays the bills return { "tier": tier, "n tasks": n tasks, "attempts": attempts, "successes": successes, "success rate": success rate, "spend": spend, "naive per call": naive per call, "naive per task": naive per task, "true per success": true per success, } def main : cheap = summarize "cheap tier" robust = summarize "robust tier" print "=" 64 print "COST PER SUCCESSFUL TASK — sticker price vs the real bill" print " stdlib simulation of the accounting mechanism; not an LLM bench " print "=" 64 print hdr = "{:<12} {: 8} {: 9} {: 9} {: 12} {: 16}" print hdr.format "tier", "per-call", "tasks", "success%", "per-task", "per-SUCCESS-task" for r in cheap, robust : print hdr.format r "tier" .replace " tier", "" , f"${r 'naive per call' :.4f}", r "n tasks" , f"{r 'success rate' 100:.0f}%", f"${r 'naive per task' :.4f}", f"${r 'true per success' :.4f}", print Who wins on the sticker per-call price? sticker winner = min cheap, robust , key=lambda r: r "naive per call" Who wins on the number that actually pays the bills? real winner = min cheap, robust , key=lambda r: r "true per success" ratio = cheap "true per success" / robust "true per success" print f"Sticker price says cheapest: {sticker winner 'tier' .replace ' tier','' } " f" ${sticker winner 'naive per call' :.4f}/call " print f"Cost-per-SUCCESS says cheapest: {real winner 'tier' .replace ' tier','' } " f" ${real winner 'true per success' :.4f}/success " print print f"The 'cheap' tier is {robust 'naive per call' /cheap 'naive per call' :.1f}x " f"cheaper per call," print f"but {ratio:.2f}x MORE EXPENSIVE per successful task." print print f"Why: cheap tier burned {cheap 'attempts' } attempts for " f"{cheap 'successes' } successes " f" {cheap 'attempts' /cheap 'successes' :.2f} attempts/success ;" print f" robust tier burned {robust 'attempts' } attempts for " f"{robust 'successes' } successes " f" {robust 'attempts' /robust 'successes' :.2f} attempts/success ." print print "VERDICT: the per-call price flipped the winner. The decision is made" print "BEFORE you spend — on cost-per-success, not on the sticker." ---- asserts: lock the invariants that make the article true ---- 1 cheap really is cheaper per call assert cheap "naive per call" < robust "naive per call" 2 ...but the winner FLIPS on cost-per-success assert cheap "true per success" robust "true per success" assert sticker winner "tier" == "cheap tier" assert real winner "tier" == "robust tier" 3 the flip is material cheap is 1.5x worse per success assert ratio 1.5 print print "All asserts passed." if name == " main ": main Run it with python3 -I cost per successful task.py . Here is the exact output: ================================================================ COST PER SUCCESSFUL TASK — sticker price vs the real bill stdlib simulation of the accounting mechanism; not an LLM bench ================================================================ tier per-call tasks success% per-task per-SUCCESS-task cheap $0.0020 40 65% $0.0063 $0.0096 robust $0.0050 40 98% $0.0057 $0.0059 Sticker price says cheapest: cheap $0.0020/call Cost-per-SUCCESS says cheapest: robust $0.0059/success The 'cheap' tier is 2.5x cheaper per call, but 1.63x MORE EXPENSIVE per successful task. Why: cheap tier burned 125 attempts for 26 successes 4.81 attempts/success ; robust tier burned 46 attempts for 39 successes 1.18 attempts/success . VERDICT: the per-call price flipped the winner. The decision is made BEFORE you spend — on cost-per-success, not on the sticker. All asserts passed. Read the table once. Per call, cheap is $0.0020 and robust is $0.0050 — exactly the 2.5x discount the sticker promises. Per successful task , cheap is $0.0096 and robust is $0.0059 . The ranking flips. The discount didn't disappear; it got spent on the 14 tasks that never succeeded and the retries chasing them. The line that explains everything is the last one in the output: cheap burned 125 attempts for 26 successes — 4.81 attempts per success. Robust burned 46 attempts for 39 successes — 1.18 attempts per success. That's a 4x difference in how many billed calls it takes to get one usable result. A 2.5x price discount cannot survive a 4x attempt penalty. The math isn't close. 0.0020 × 4.81 ≈ 0.0096 . 0.0050 × 1.18 ≈ 0.0059 . The cheap tier is cheaper at the unit you don't ship and more expensive at the unit you do. Notice the middle column too — per-task, spreading spend over all 40 tasks, the two tiers look almost tied: $0.0063 vs $0.0057 . That column is a trap of its own. It counts the failed tasks in the denominator as if they were worth something. They weren't. Divide only by what succeeded and the real gap shows up. You might be thinking: my two options aren't 65% vs 98%, they're more like 92% vs 95%, so this doesn't apply to me. Maybe. That's exactly the point, though — you don't know until you count, and you can't eyeball a 4x attempt ratio from a pricing table. A small gap in success rate matters more than it looks when one tier also retries harder. Two things compound: the fraction that never succeeds pure waste and the attempts-per-success on the ones that do. A tier can have a "fine" 90% success rate and still burn three attempts on every hard task, and that second factor never shows up as a failure in your dashboard — it shows up as a bigger bill. So don't guess the gap. Log it. Here's the honest limit of my own claim, since I'm asking you to log yours: I haven't run this exact A/B across two named LLM APIs in production. The retry skew is real and comes from my scraper logs, where flaky sources have always cost multiples more per usable record than stable ones. The two-tier flip in the script is a clean illustration of that pattern, not a vendor benchmark. If you run it for real and the gap is small — great, you just bought certainty for the price of one week of logging. The change is procedural, not technical, and it happens before you commit to a tier — not as a spending cap you add after the bill scares you. total spend ÷ successes , not by the price page.This is upstream of every budget guardrail. A spending cap stops you after you've chosen wrong and started bleeding. Choosing on cost-per-success means there's less to cap, because you picked the tier that wastes fewer attempts in the first place. If you do want the downstream guardrail too, the HTTP 402 budget piece https://blog.spinov.online/blog/http-402-your-ai-agent-pay-per-crawl/ is the other half of this — that one's about capping spend during a run; this one's about which option you pick before the run. The price page sells you a per-call number because it's the number that makes them look cheapest. The number that pays your invoice is per successful task. Compute the second one yourself, with your own logs, before you switch. What's the widest gap you've seen between the sticker price and the real cost-per-success once you counted the retries? I'm collecting the worst flips — drop yours in the comments. 👇 Follow for the next batch of cost-per-success numbers from production. I read every comment.