cd /news/artificial-intelligence/lessons-from-a-109-agent-code-audit-… · home topics artificial-intelligence article
[ARTICLE · art-25907] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=· neutral

Lessons from a 109-agent code audit workflow

A developer spent 9.3 million tokens ($46) on a 109-agent code audit pipeline that found 32 verified issues in a 5,000-line codebase. The key mistake was verifying all findings before ranking them, wasting 79% of agents on findings that were later discarded. Reversing the order—ranking first, then verifying only top findings—could cut agent count by 70%.

read4 min publishedJun 13, 2026

#

I spent 9.3M tokens on a 109-agent code audit so you don't have to

The short version: I pointed a swarm of AI agents at a ~5k-line codebase to hunt for things worth fixing. The pipeline was parallel subsystem mappers → 8 "finder" lenses → dedup → adversarial verification of every finding → a ranking panel → synthesis.

It worked. 32 verified findings, a clean top-10, no complaints about the output. But it cost ~9.3M tokens (call it $46 of API time) and somewhere north of two-thirds of that was me lighting money on fire. Here's the autopsy.

#

Where the tokens actually went

| Stage | Agents | What went wrong | | Verify | 86 of 109 (79%!) | Killed only 2 of 34 findings — I paid the full cost of re-reading the code 86 times for a 6% hit rate | | Map | 9 | Redundant. The finders re-read the code anyway (correctly), so the map was a tax I paid twice | | Find | 8 lenses | ~30% overlap between them (48 raw findings collapsed to 34 after dedup) | | Prompts | — | JSON.stringify(x, null, 2) — yes, the pretty-printing — quietly bloated every downstream prompt by 30-40% |

The kicker: cache reads were 77% of all tokens. Every agent spawns with a fresh context and re-reads the same files from scratch. That's just the cost of fanning out wide — which means you'd better fan out where it pays.

#

The one mistake that mattered

I verified everything before I ranked anything. Adversarial verification is the single most expensive stage per unit of value, and most findings never even make the final cut. So I was paying premium prices to fact-check findings that were destined for the cutting-room floor.

Flip the order:

Same output. ~70% fewer agents. That's the whole lesson, really — the rest is footnotes.

#

Footnotes (the actual rules, in order of how much they save you)

Rank first, verify after. Only spend adversarial verification on the findings that'll actually appear in the deliverable. #

Match the paranoia to the stakes. A wrong finding in an internal audit costs someone a few minutes of reading — one refuter is plenty. Save the full 3-lens panel (code-truth / impact / approach) for finalists or claims that trigger real action. #

Batch verification by file. 34 findings lived in ~10 files. One verifier checking every finding against the same file reads it once instead of ten times. Never do per-finding-per-lens fan-out — that's the expensive way to learn this lesson. #

Skip the mappers on small repos (<10k LOC). One agent can swallow the whole thing, and the finders re-read it anyway, so a map phase is overhead twice over: once to build it, once to staple it onto every finder prompt. #

Six finder lenses, tops — with explicit "you don't cover X, that's the other agent's job" boundaries. Overlap isn't free; every duplicate has to swim through dedup and verification before it dies. #

Compact your JSON. JSON.stringify(x)

, never null, 2

. At agent-prompt scale, pretty-printing is just padding nobody reads. #

Send the cheap models to do the chores. Dedup, evidence-checking, code-truth verification — none of it needs the frontier model. #

Set a token budget up front and have the orchestrator check what's left before each fan-out, scaling the fleet down instead of barreling ahead open-loop.

#

What I'd keep, no notes

The code-truth veto in verification earned its keep — it caught 2 plausible-but-wrong findings that would've embarrassed me in the final report. Adversarial checking is worth every token on the things that survive ranking. #

Structured output schemas on every agent: zero parse failures across all 109. Boring, reliable, great. #

Hard caps on finder output (max 6 each) kept the funnel from ballooning. #

Resumable runs with cached results — I stopped mid-run and picked it back up later for basically nothing.

#

TL;DR

Fan out to find. Converge before you verify. Breadth is for discovery; rigor is for the survivors. Do it backwards and you'll end up — as I did — with 79% of your agents diligently auditing things no human will ever read.

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/lessons-from-a-109-a…] indexed:0 read:4min 2026-06-13 ·