{"slug": "your-agent-success-rate-counts-only-the-survivors", "title": "Your Agent Success Rate Counts Only the Survivors", "summary": "An engineer at Apify warns that agent success rates are often inflated by survivorship bias, as runs that time out, are aborted, or hang without a terminal status are excluded from the denominator. The engineer argues that focusing on failed runs misses the real problem: runs that never report back, which quietly improve the metric while hiding dangerous failures.", "body_md": "Your agent dashboard says 90% success. It is wrong, and not because the math is sloppy. It is wrong because of which runs it forgot to count. Every run that timed out, got aborted, or is still stuck in `RUNNING`\n\nthree hours later has quietly slipped out of the denominator. A run that `FAILED`\n\nis the honest one. It raised its hand, it sits in your error logs, it is already dragging the number down where it belongs. The run you should be scared of is the one that never came back to tell you anything.\n\nThat is survivorship bias, and it lives in almost every reliability number I have looked at.\n\n**TL;DR**\n\n`FAILED`\n\nrun is already counted. The dangerous run has no terminal verdict at all.In 1943 the US military looked at bombers coming home from Europe and mapped where they were taking the most damage. Wings. Fuselage. Tail. The obvious move was to bolt armor onto those spots. Abraham Wald, a statistician at the Statistical Research Group, argued the opposite. Armor the engines, the one place with almost no holes on the planes that landed. The planes hit in the engines were not in his sample. They never made it home to be measured. The damage you do not see is the damage that kills.\n\nYour run ledger has the same shape. You measure the runs that came home.\n\nMost success-rate code I have seen, mine included, looks like this in spirit: take the count of `SUCCEEDED`\n\n, divide by `SUCCEEDED`\n\nplus `FAILED`\n\n, multiply by a hundred. Clean. It reads like a pass rate on an exam. The trouble hides in the words \"plus `FAILED`\n\n,\" because that is the entire denominator. You are dividing wins by the runs that came back with a clear yes or a clear no.\n\nPlenty of runs never come back with either.\n\nA long crawl's worker drops off the network at row 9,000 and never reports back. A run hits a wall-clock limit and the platform marks it `TIMED_OUT`\n\n. Someone kills a wedged job by hand. And the worst case of all: a run that simply hangs. No exit code, no terminal status, no log line after 14:02. It is still listed as `RUNNING`\n\ndays later because nothing ever wrote the ending.\n\nNone of those are `SUCCEEDED`\n\n. None of them are `FAILED`\n\neither. So a \"succeeded over succeeded-plus-failed\" rate does not rate them low. It deletes them from the question. The denominator shrinks and the rate climbs. The more runs that vanish into a non-terminal limbo, the healthier the dashboard looks. The metric rewards exactly the failure mode that should scare you most.\n\nHere is the part that took me embarrassingly long to see. I spent a couple of days hardening error handling. Tighter try/except boundaries, retries with backoff, cleaner `FAILED`\n\nrecords. None of it moved the real number, because `FAILED`\n\nwas never the problem.\n\nA `FAILED`\n\nrun is the honest citizen of your ledger. It threw an exception you could catch. It is in your error logs, it is in your alerts, and it is already inside the denominator. When you polish error handling you are improving the runs that already report themselves.\n\nThe runs that corrupt the metric are the ones with no clean verdict. Timed out. Aborted. Stuck in a transitional status that never resolves. They do not throw anything, because from your code's point of view nothing happened. The process just stopped existing. You cannot try/except a worker whose node died mid-run without ever writing a final status. There is no stack trace for a run that is still, technically, \"running.\" So the bug is not in your handler. The bug is in your denominator.\n\nIt helps to borrow a vocabulary that already names this. Apify, the platform our actors run on, documents every actor run as carrying one status from a small fixed set, grouped into three kinds (verified against their docs, link at the end):\n\n`READY`\n\n, \"Started but not allocated to any worker yet.\"`RUNNING`\n\n, `TIMING-OUT`\n\n, `ABORTING`\n\n. The run is in motion.`SUCCEEDED`\n\n, `FAILED`\n\n, `TIMED-OUT`\n\n, `ABORTED`\n\n. The run is done.Their docs put it plainly: a run begins in the initial state, progresses through one or more transitional phases, and concludes in one of the terminal states. That is the whole lifecycle. Our own run ledgers, across 2,190 production runs on 32 actors, live entirely inside this vocabulary. The Trustpilot review scraper alone holds 962 runs in that table, and the long ones, the crawls that grind for an hour, are exactly the runs that flirt with the memory ceiling and the timeout. They are the most likely to end up `TIMED_OUT`\n\nor wedged in a transitional state. So the runs a naive rate silently drops are the same runs that were hardest to keep alive. The metric goes blind precisely where the work is hardest.\n\nA naive pass rate uses two of those terminal statuses and throws away the other two terminals plus every transitional run. Three buckets flattened into a yes or no.\n\nHere is a tiny script. No imports, no network, no randomness, no clock. A dictionary of run counts and three ways to divide it. The ledger is synthetic, hand-built to isolate the mechanism, not a measurement of any single actor. I will come back to why that distinction matters.\n\n```\n\"\"\"\nsurvivorship_success_rate.py - your agent's success rate is measured on the\nruns that survived long enough to report a verdict.\n\nA run ledger (an Apify actor's run list, a CI job table) carries one status per\nrun. Apify documents them as initial (READY), transitional (RUNNING, TIMING-OUT,\nABORTING) and terminal (SUCCEEDED, FAILED, TIMED-OUT, ABORTED). A run \"goes\nthrough one or more transitional statuses to one of the terminal statuses\".\n\nA dashboard that divides SUCCEEDED by \"clean pass/fail\" drops every run that\ntimed out, was aborted, or is still transitional. Those runs leave the\ndenominator, so they inflate the rate by being invisible.\n\nCounter-take: the fix is NOT better error handling. A FAILED run is honest - it\nalready sits in your error logs, in the denominator. The runs that wreck the\nmetric are the ones with no clean verdict (timed out / aborted / still RUNNING).\nThe one-line fix is to change the denominator from \"runs that finished\" to\n\"runs that started\". The single most dangerous run is the one stuck in a\ntransitional status forever: it has no terminal record at all.\n\nThis ledger is SYNTHETIC, hand-built to isolate the mechanism. The 90.0 -> 72.0\ngap is illustrative of the arithmetic, not a measured rate from any one actor.\n\nRun: python3 -I survivorship_success_rate.py\nstdlib only, 0 imports, 0 network / 0 RNG / 0 clock -> identical stdout, always.\n\"\"\"\n\n# Run counts by status. RUNNING here = a run that began but never reached a\n# terminal status: it hung, was OOM-killed mid-stream, or infra dropped it.\nLEDGER = {\n    \"SUCCEEDED\": 36,   # terminal\n    \"FAILED\":     4,   # terminal\n    \"TIMED_OUT\":  5,   # terminal\n    \"ABORTED\":    3,   # terminal\n    \"RUNNING\":    2,   # transitional - never resolved\n}\n\nPASS_FAIL = (\"SUCCEEDED\", \"FAILED\")                       # naive \"clean verdict\" set\nTERMINAL = (\"SUCCEEDED\", \"FAILED\", \"TIMED_OUT\", \"ABORTED\")  # all terminal statuses\n\nsucceeded = LEDGER[\"SUCCEEDED\"]\nattempts = sum(LEDGER.values())                  # every run that STARTED\npassfail_denom = sum(LEDGER[s] for s in PASS_FAIL)\nterminal_denom = sum(LEDGER[s] for s in TERMINAL)\n\nnaive_rate = round(100 * succeeded / passfail_denom, 1)    # succeeded / clean pass+fail\nterminal_rate = round(100 * succeeded / terminal_denom, 1)  # succeeded / all terminals\nhonest_rate = round(100 * succeeded / attempts, 1)          # succeeded / runs that started\nhidden = attempts - passfail_denom\n\nprint(\"=== run ledger (every run that wrote a start record) ===\")\nfor status, n in LEDGER.items():\n    kind = \"transitional\" if status == \"RUNNING\" else \"terminal\"\n    print(f\"  {status:<10} {n:>3}   ({kind})\")\nprint(f\"  {'-'*10} {'-'*3}\")\nprint(f\"  {'ATTEMPTS':<10} {attempts:>3}\")\nprint()\nprint(\"=== three denominators, one numerator (succeeded = 36) ===\")\nprint(f\"  NAIVE    succeeded / pass+fail      : 36/{passfail_denom} = {naive_rate}%\")\nprint(f\"  TERMINAL succeeded / all terminals  : 36/{terminal_denom} = {terminal_rate}%\")\nprint(f\"  HONEST   succeeded / runs that began: 36/{attempts} = {honest_rate}%\")\nprint(f\"  the naive rate hides {hidden} runs (5 timed out, 3 aborted, 2 never resolved)\")\nprint(f\"  -> a dashboard reading {naive_rate}% is really running at {honest_rate}%\")\nprint()\nprint(\"=== ceiling (where this fix stops) ===\")\nprint(\"  1. HONEST is still an upper bound: it counts only runs that managed to\")\nprint(\"     write a start record. Runs killed before their first log line (OOM at\")\nprint(\"     spawn, infra drop) are in NO ledger. True rate <= 72.0%.\")\nprint(\"  2. SUCCEEDED is trusted as-is. A run that exits 0 but returns empty or\")\nprint(\"     partial data still counts as a win here. Fixing the denominator does\")\nprint(\"     not fix the definition of success - that is a separate gate.\")\nprint(\"  3. Synthetic ledger. The 90.0 -> 72.0 gap shows the arithmetic, not a\")\nprint(\"     measured rate. Your real gap is whatever your RUNNING column is.\")\n\nassert attempts == 50\nassert passfail_denom == 40\nassert terminal_denom == 48\nassert naive_rate == 90.0\nassert terminal_rate == 75.0\nassert honest_rate == 72.0\nassert hidden == 10\n```\n\nRun it:\n\n```\n=== run ledger (every run that wrote a start record) ===\n  SUCCEEDED   36   (terminal)\n  FAILED       4   (terminal)\n  TIMED_OUT    5   (terminal)\n  ABORTED      3   (terminal)\n  RUNNING      2   (transitional)\n  ---------- ---\n  ATTEMPTS    50\n\n=== three denominators, one numerator (succeeded = 36) ===\n  NAIVE    succeeded / pass+fail      : 36/40 = 90.0%\n  TERMINAL succeeded / all terminals  : 36/48 = 75.0%\n  HONEST   succeeded / runs that began: 36/50 = 72.0%\n  the naive rate hides 10 runs (5 timed out, 3 aborted, 2 never resolved)\n  -> a dashboard reading 90.0% is really running at 72.0%\n\n=== ceiling (where this fix stops) ===\n  1. HONEST is still an upper bound: it counts only runs that managed to\n     write a start record. Runs killed before their first log line (OOM at\n     spawn, infra drop) are in NO ledger. True rate <= 72.0%.\n  2. SUCCEEDED is trusted as-is. A run that exits 0 but returns empty or\n     partial data still counts as a win here. Fixing the denominator does\n     not fix the definition of success - that is a separate gate.\n  3. Synthetic ledger. The 90.0 -> 72.0 gap shows the arithmetic, not a\n     measured rate. Your real gap is whatever your RUNNING column is.\n```\n\nOne numerator, `succeeded = 36`\n\n. Three denominators.\n\nNAIVE divides by pass plus fail, 36 over 40, and reports 90.0%. This is the number most dashboards put on the big screen.\n\nTERMINAL divides by all four terminal statuses, 36 over 48, and reports 75.0%. This is what you get the moment you stop pretending the timeouts and aborts did not happen. Fifteen points, gone, just by counting every run that ended badly in any way and not only the ones that raised an error.\n\nHONEST divides by every run that started, 36 over 50, and reports 72.0%. The last three points are the two runs still stuck in `RUNNING`\n\n. They never resolved. They carry no terminal record at all, and they are the ones I would lose sleep over, because a run with no ending is a run nobody is watching.\n\nEighteen points of spread between the first number and the last. Same successes. Same ledger. The only thing that moved is what I was willing to count.\n\nI put the limits in the program's own output, because a fix that oversells itself is just a fancier kind of lying metric. Three things this does not do.\n\nFirst, HONEST is still an upper bound, not the truth. It counts runs that managed to write a start record. A run killed before its first log line, an OOM at spawn, a node that fell off the network, is in no ledger at all. It never got a row. So the real success rate is at most 72.0%, and probably under it. You cannot count what was never written down.\n\nSecond, `SUCCEEDED`\n\nis taken on faith. A run that exits zero but returns an empty array or half a dataset still scores as a win in this script. Fixing the denominator does not fix the definition of success. That is a separate gate, and I have written about that other half before: a run can pass and still hand you garbage, like [a clean row that was quietly wrong](https://blog.spinov.online/blog/your-scraper-returned-a-clean-row-it-was-wrong/). Counting outcomes honestly and judging whether an outcome was actually good are two different jobs.\n\nThird, the ledger is synthetic. The jump from 90.0% to 72.0% shows you the arithmetic, not a benchmark. Your real gap is whatever the size of your `RUNNING`\n\ncolumn happens to be. If almost nothing ever hangs, your naive and honest rates sit close together, and good for you. If your transitional column is fat, your dashboard is off by a margin you have never measured.\n\nIt would be easy to file this next to the \"clean row that was wrong\" post above. They are not the same bug. That one is about the value inside a single run: a row that parsed fine and still held junk, a rating of 7 on a five-star site. This one sits a level up. It says nothing about whether any individual run's output is correct. It is about how you count runs across the whole population. A run can succeed with flawless data, and if its neighbor hung in silence, your aggregate rate is still wrong about the fleet.\n\nIt is also not the eval problem. When you [write a regression gate for an agent's final answer](https://blog.spinov.online/blog/you-cant-unit-test-an-ai-agent-regression-gate/), you are judging the quality of one response against a rubric. Useful, necessary, and orthogonal to this. A success rate counts how runs ended, not what they produced. You can own a flawless eval suite and a success rate that is still inflated by survivorship, because the eval only ever sees the runs that returned something to grade. Same blind spot, one floor up.\n\nThe actual fix is almost insultingly small. Change the denominator. Count runs that started, not runs that finished. If your run table gets a row the instant a run is created, then the denominator is just that row count, full stop, including everything still marked `RUNNING`\n\n.\n\nOne caveat I owe you here. A run that started ninety seconds ago and is still `RUNNING`\n\nis not a failure, it just has not finished. That run is right-censored, not lost, and counting it against you on a live snapshot biases the rate the other way, pessimistically, by lumping healthy in-flight work in with the dead. So the honest denominator is a settled one: count over a window that has already drained, or age-gate the transitional runs by the rule in the next paragraph. Younger than that threshold, a run is still pending, not a loss. The synthetic ledger above sidesteps this by definition; its two `RUNNING`\n\nrows are the long-dead kind. On a live dashboard you draw that line yourself.\n\nTwo things I added around it turned out to be worth more than the metric itself.\n\nI started alerting on the age of transitional runs. A run that has been `RUNNING`\n\nfor three times its median duration is not running. It is dead and lying about it. That alert caught more real problems than the success rate ever did, because it points straight at the runs the rate was hiding.\n\nAnd I put the denominator next to the rate on the dashboard. \"94% of 312 terminal\" and \"94% of 1,040 started\" are two very different sentences, and showing both makes the gap impossible to scroll past. When the started count and the terminal count drift apart, that drift is your survivorship tax, written in plain numbers.\n\nI am not going to quote you the percentage of our runs that hang, because the honest answer is that for a long stretch I was not measuring it, which is the entire point of this post. The number you cannot see is the number that gets you. Wald armored the engines. Count what started, not what finished.\n\n*Written by Aleksei Spinov. I run production scrapers and agents, currently 2,190 runs across 32 actors. The code here is stdlib-only and was run and verified ( python3 -I, identical output, asserts green) before publishing; the ledger numbers are synthetic and labelled as such in the script. Drafted with an AI assistant, fact-checked and edited by me.*\n\n*Follow for the next teardown from the run ledger, one fix at a time. Genuine question for the comments: what is the longest a run has sat in your dashboard still marked RUNNING long after it was actually dead, and what finally made you notice? I read every reply.*\n\n*Source: Apify Actor run lifecycle statuses.*", "url": "https://wpnews.pro/news/your-agent-success-rate-counts-only-the-survivors", "canonical_source": "https://dev.to/0012303/your-agent-success-rate-counts-only-the-survivors-dj1", "published_at": "2026-06-29 18:07:12+00:00", "updated_at": "2026-06-29 18:18:48.075681+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "machine-learning"], "entities": ["Apify", "Abraham Wald"], "alternates": {"html": "https://wpnews.pro/news/your-agent-success-rate-counts-only-the-survivors", "markdown": "https://wpnews.pro/news/your-agent-success-rate-counts-only-the-survivors.md", "text": "https://wpnews.pro/news/your-agent-success-rate-counts-only-the-survivors.txt", "jsonld": "https://wpnews.pro/news/your-agent-success-rate-counts-only-the-survivors.jsonld"}}