{"slug": "lessons-from-a-109-agent-code-audit-workflow", "title": "Lessons from a 109-agent code audit workflow", "summary": "A developer spent 9.3 million tokens ($46) on a 109-agent code audit pipeline that found 32 verified issues in a 5,000-line codebase. The key mistake was verifying all findings before ranking them, wasting 79% of agents on findings that were later discarded. Reversing the order—ranking first, then verifying only top findings—could cut agent count by 70%.", "body_md": "##\nI spent 9.3M tokens on a 109-agent code audit so you don't have to\n\nThe short version: I pointed a swarm of AI agents at a ~5k-line codebase to hunt for things worth fixing. The pipeline was parallel subsystem mappers → 8 \"finder\" lenses → dedup → adversarial verification of *every* finding → a ranking panel → synthesis.\n\nIt worked. 32 verified findings, a clean top-10, no complaints about the output. But it cost ~9.3M tokens (call it $46 of API time) and somewhere north of two-thirds of that was me lighting money on fire. Here's the autopsy.\n\n##\nWhere the tokens actually went\n\n| Stage |\nAgents |\nWhat went wrong |\n| Verify |\n86 of 109 (79%!) |\nKilled only 2 of 34 findings — I paid the full cost of re-reading the code 86 times for a 6% hit rate |\n| Map |\n9 |\nRedundant. The finders re-read the code anyway (correctly), so the map was a tax I paid twice |\n| Find |\n8 lenses |\n~30% overlap between them (48 raw findings collapsed to 34 after dedup) |\n| Prompts |\n— |\n`JSON.stringify(x, null, 2)` — yes, the pretty-printing — quietly bloated every downstream prompt by 30-40% |\n\nThe kicker: cache reads were 77% of all tokens. Every agent spawns with a fresh context and re-reads the same files from scratch. That's just the cost of fanning out wide — which means you'd better fan out where it pays.\n\n##\nThe one mistake that mattered\n\n**I verified everything before I ranked anything.** Adversarial verification is the single most expensive stage per unit of value, and most findings never even make the final cut. So I was paying premium prices to fact-check findings that were destined for the cutting-room floor.\n\nFlip the order:\n\nSame output. ~70% fewer agents. That's the whole lesson, really — the rest is footnotes.\n\n##\nFootnotes (the actual rules, in order of how much they save you)\n\n-\n**Rank first, verify after.** Only spend adversarial verification on the findings that'll actually appear in the deliverable.\n-\n**Match the paranoia to the stakes.** A wrong finding in an internal audit costs someone a few minutes of reading — one refuter is plenty. Save the full 3-lens panel (code-truth / impact / approach) for finalists or claims that trigger real action.\n-\n**Batch verification by file.** 34 findings lived in ~10 files. One verifier checking every finding against the same file reads it once instead of ten times. Never do per-finding-per-lens fan-out — that's the expensive way to learn this lesson.\n-\n**Skip the mappers on small repos** (<10k LOC). One agent can swallow the whole thing, and the finders re-read it anyway, so a map phase is overhead twice over: once to build it, once to staple it onto every finder prompt.\n-\n**Six finder lenses, tops** — with explicit \"you don't cover X, that's the other agent's job\" boundaries. Overlap isn't free; every duplicate has to swim through dedup and verification before it dies.\n-\n**Compact your JSON.** `JSON.stringify(x)`\n\n, never `null, 2`\n\n. At agent-prompt scale, pretty-printing is just padding nobody reads.\n-\n**Send the cheap models to do the chores.** Dedup, evidence-checking, code-truth verification — none of it needs the frontier model.\n-\n**Set a token budget up front** and have the orchestrator check what's left before each fan-out, scaling the fleet down instead of barreling ahead open-loop.\n\n##\nWhat I'd keep, no notes\n\n-\n**The code-truth veto in verification** earned its keep — it caught 2 plausible-but-wrong findings that would've embarrassed me in the final report. Adversarial checking is worth every token *on the things that survive ranking.*\n-\n**Structured output schemas** on every agent: zero parse failures across all 109. Boring, reliable, great.\n-\n**Hard caps on finder output** (max 6 each) kept the funnel from ballooning.\n-\n**Resumable runs with cached results** — I stopped mid-run and picked it back up later for basically nothing.\n\n##\nTL;DR\n\nFan out to *find*. Converge before you *verify*. Breadth is for discovery; rigor is for the survivors. Do it backwards and you'll end up — as I did — with 79% of your agents diligently auditing things no human will ever read.", "url": "https://wpnews.pro/news/lessons-from-a-109-agent-code-audit-workflow", "canonical_source": "https://dev.to/ayoubzulfiqar/lessons-from-a-109-agent-code-audit-workflow-4a5m", "published_at": "2026-06-13 04:52:15+00:00", "updated_at": "2026-06-13 05:17:20.084678+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "developer-tools", "large-language-models"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/lessons-from-a-109-agent-code-audit-workflow", "markdown": "https://wpnews.pro/news/lessons-from-a-109-agent-code-audit-workflow.md", "text": "https://wpnews.pro/news/lessons-from-a-109-agent-code-audit-workflow.txt", "jsonld": "https://wpnews.pro/news/lessons-from-a-109-agent-code-audit-workflow.jsonld"}}