{"slug": "ai-coding-agents-need-evidence-first-review-not-just-cheaper-routing", "title": "AI coding agents need evidence-first review, not just cheaper routing", "summary": "AI coding agents may reduce generation costs but shift work to verification, with studies showing mixed productivity effects and a potential increase in review load. The METR trial found AI-assisted tasks took 19% longer on average, while DORA data links AI adoption to higher throughput but lower stability. Experts argue that comparing tools solely by token price misses the critical cost of checking AI-generated code before merge.", "body_md": "In many AI-assisted workflows, code generation is no longer the only bottleneck. Assistants read repositories, edit files, run commands, and write tests. Agentic systems plan, call tools, retrieve more context, and assemble an answer over several steps or several models.\n\nWhat was actually checked, what did the model merely assume, and how much of this result can I rely on before merge?\n\nProducing plausible code has become cheaper. Checking its foundations has not necessarily followed. Comparing AI tools only by token price, generation speed, or agent count misses the engineering decision that matters: the path from a request to a justified merge decision.\n\nThis article asks three questions:\n\n- Does AI reduce total decision cost once calls, review, rework, and escaped-error risk are counted?\n- Which part of that cost is targeted by routing, retrieval, multi-model deliberation, and automated checks?\n- What should a verification layer produce, and how can its value be falsified rather than merely claimed?\n\n## 1. The verification tax\n\nThe productivity evidence is mixed. METR ran a randomized controlled trial with\n16 experienced open-source developers performing 246 real tasks in mature\nrepositories they knew well, using early-2025 tooling. With AI, tasks took\n**19% longer** on average [[1]](#ref-1).\n\nIn February 2026, METR reported that newer data probably shows a larger uplift,\nbut explicitly called the signal unreliable. The raw estimate for returning\ndevelopers was `-18%`\n\nchange in completion time with a confidence\ninterval of `[-38%, +9%]`\n\n; for newly recruited developers it was\n`-4%`\n\nwith `[-15%, +9%]`\n\n, where negative means speedup.\nBoth intervals include zero effect [[2]](#ref-2).\n\nThe honest conclusion is neither “AI always speeds developers up” nor “AI always slows them down.” Productivity depends on tool maturity, repository familiarity, task shape, context acquisition, and the cost of checking the result.\n\nThe 2025 DORA report provides a different, observational view of nearly 5,000\ntechnology professionals: **90%** use AI at work, more than\n**80%** perceive a productivity gain, but **30%** have\nlittle or no trust in AI-generated code. AI adoption is positively associated with\ndelivery throughput and product performance and negatively associated with\ndelivery stability [[9]](#ref-9). This is not a causal estimate. It is\nconsistent with a systems hypothesis: faster local generation may increase\ndownstream load if testing and delivery controls do not scale with change volume.\n\nA synthesis of seven Google studies found that **39%** of external\ndevelopers trust GenAI output quality only slightly or not at all. Perceived rigor\nof review and testing, and developer control over where AI is used, were positively\nassociated with trust [[7]](#ref-7).\n\nReview itself is not only defect-finding. In Bacchelli and Bird’s study of 200\nMicrosoft review threads and 570 comments, code improvements accounted for\n**29%** of comments and defects for **14%**. The authors\nidentify understanding the context and the change as central to review and record\nknowledge transfer as an outcome in its own right [[3]](#ref-3).\n\n### An illustrative review-load model\n\nAssume a team handles 20 PRs per week and an average review takes 30 minutes:\n\n```\n20 PR × 0.5 h = 10 reviewer-hours / week\n```\n\nIf AI doubles throughput while review cost per PR stays fixed:\n\n```\n40 PR × 0.5 h = 20 reviewer-hours / week\n```\n\nIf AI-assisted PRs become wider and review time rises by 25%:\n\n```\n40 PR × 0.625 h = 25 reviewer-hours / week\n```\n\n| Scenario | PR/wk | Review/PR | Review load |\n|---|---|---|---|\n| Pre-AI | 20 | 30 min | 10 h |\n| 2× throughput | 40 | 30 min | 20 h |\n| 2× throughput + wider PRs | 40 | 37.5 min | 25 h |\n\nThis is a sensitivity model, not a market statistic. It shows the mechanism: faster generation may move work from writing to checking rather than remove it.\n\n## 2. The total cost of an engineering decision\n\nThe token bill is not the total cost. Define the expected cost of one decision:\n\n```\nC_total = C_model + C_tools + R_hour × (T_review + T_rework) + P_escape × L_escape\n```\n\n`C_model`\n\n: model calls;`C_tools`\n\n: CI, sandbox, retrieval, and other compute;`R_hour`\n\n: internal cost of one engineering hour;`T_review`\n\n: time to an apply/review/reject decision;`T_rework`\n\n: expected time to fix issues found before merge;`P_escape`\n\n: probability that a material error passes review;`L_escape`\n\n: expected loss from such an escape.\n\nTake an illustrative baseline: `C_model = $5`\n\n, review takes 60 minutes,\nand `R_hour = $80`\n\n. Set tools, rework, and risk aside temporarily:\n\n``` php\nC_total = $5 + $80 = $85\n```\n\n### The ceiling on pure model-bill optimization\n\nIf model calls are a fraction `f = C_model / C_total`\n\n, then optimizing\nonly the model bill while holding workload, quality, review, rework, and risk fixed\nlowers `C_total`\n\nby at most `f`\n\n. At the reference numbers:\n\n```\nf = 5 / 85 = 5.9%\n```\n\nThis is not a ceiling on routing’s total effect. A weaker cheap model may raise\nretries, `T_rework`\n\n, and `P_escape`\n\n; a good router may cut\nlatency and failed calls. It is an accounting observation: when the model bill is\na small part of the total, optimizing that line alone cannot solve a review-bound\nbottleneck.\n\nCutting review from 60 to 40 minutes produces a different scale of change:\n\n``` php\nC_total = $5 + $80 × (40/60) = $58.33\nSaving = ($85 - $58.33) / $85 = 31.4%\n```\n\n| Change | Model | Review | C_total | Saving |\n|---|---|---|---|---|\n| Baseline | $5.00 | $80.00 | $85.00 | — |\n| Model calls halved | $2.50 | $80.00 | $82.50 | 2.9% |\n| Review 60→40 min | $5.00 | $53.33 | $58.33 | 31.4% |\n| Both | $2.50 | $53.33 | $55.83 | 34.3% |\n\nIn autonomous agentic loops with little human oversight, `f`\n\nmay be\nlarge and routing can become the main economic lever. In workflows constrained by\ncostly human review, `f`\n\nis lower. The relevant question is which term\nactually dominates the total cost.\n\n## 3. Different systems control different parts of the cost\n\nModern AI systems often look similar: agents, orchestration, retrieval, a judge, and synthesis. Similar shape does not imply the same job.\n\n### Routing: Kilo Gateway and RouteLLM\n\nKilo exposes an OpenAI-compatible endpoint, access to many models, BYOK, usage\ntracking, spend limits, and organization controls [[11]](#ref-11).\nByteByteGo describes routing on a known mode — planning, coding, debugging — with\nuser-selected tiers and a server-updated model map. The reported Kilo figures —\nroughly one-third lower average request cost, 80–90% of requests not requiring\nfrontier models, a greater-than-10× tier gap, and an estimated $87K quarterly\noverspend from misrouting routine traffic — are vendor-reported and not\nindependently verified [[8]](#ref-8).\n\nAn idealized model shows the potential scale:\n\n```\nrelative_cost = 0.15 × 1 + 0.85 × 0.10 = 0.235\nrelative reduction = 1 - 0.235 = 76.5%\n```\n\nRouteLLM provides primary research evidence for the trade-off: a 3.66× cost-saving\nratio at 95% of GPT-4’s MT-Bench score for a GPT-4/Mixtral-8×7B pair, equivalent to\n72.7% relative cost reduction [[12]](#ref-12). Its cost model uses short\nsingle-turn prompts and benchmark score as quality. It is not a coding-agent loop\nor evidence that a repository change is safe.\n\n### Agentic RAG: sufficient context\n\nGoogle describes a multi-agent RAG with a dedicated Sufficient Context Agent. It\ncompares the query, retrieved snippets, and a draft, names missing information,\nand can trigger another retrieval pass. Google reports up to 34% higher accuracy\nthan standard RAG on factuality datasets [[4]](#ref-4).\n\nThe Sufficient Context research exposes a broader failure mode: models often answer\nincorrectly rather than abstain when context is insufficient. Guided abstention\nimproved correctness among answered cases by 2–10% for Gemini, GPT, and Gemma\n[[5]](#ref-5).\n\nThis supports a sufficient-context loop, but it is not a measured reduction in\n`T_rework`\n\nor `P_escape`\n\nfor software development. A codebase\nis not merely a document corpus; it contains runtime behavior, callers, invariants,\nand migrations.\n\n### Multi-model deliberation: consensus is not proof\n\nOpenRouter Fusion runs a parallel panel of 1–8 models. A judge returns a structured\ncomparison of consensus, contradictions, partial coverage, unique insights, and\nblind spots; a final model writes the answer. The documentation describes the\npipeline but does not provide an independent effectiveness benchmark\n[[10]](#ref-10).\n\nGoogle Research compared 180 agent configurations. Independent topology amplified\nerrors by up to **17.2×**, while centralized coordination held\namplification to **4.4×**. Multi-agent improved the parallelizable\nFinance-Agent result by **80.9%**, but every multi-agent variant\ndegraded the sequential PlanCraft result by **39–70%**. The authors’\npredictive model selected the optimal architecture for 87% of unseen configurations\n[[6]](#ref-6).\n\nThis evaluation did not contain repository code review. The narrower engineering hypothesis is that value depends on topology, task decomposability, a centralized gate, and evidence handoffs — not on agent count alone.\n\n### Tests and static analysis\n\nSAST, DAST, CodeQL, Semgrep, unit tests, and mutation tests provide repeatable checks of explicitly encoded properties under controlled inputs, configuration, and environment. Their quality is bounded by coverage, false positives, false negatives, and flakiness.\n\nThey are necessary, but do not always reveal that a model never opened the relevant file, built a conclusion on a false assumption, or tested an implementation detail instead of a system invariant. Green checks are not proof of complete intent.\n\n## 4. Side by side\n\n| Approach | Primary problem | Unit of decision | Main output | Does not solve by itself |\n|---|---|---|---|---|\n| Kilo / routing | Model access, cost, policy | Model request | Completion + cost data | Trust in an engineering change |\n| Agentic RAG | Incomplete context | Context sufficiency | Grounded answer | Patch safety and codebase invariants |\n| Fusion / multi-model | Fragility of one answer | Agreement/disagreement | Consensus + contradictions | Factual checking of repository claims |\n| Tests / static | Formalizable properties | Test/rule result | Pass/fail + diagnostics | Intent, assumptions, completeness |\n| Verification artifact | Hidden checking area | Merge decision | Evidence boundaries + verdict | A correctness guarantee |\n\nThese systems are not necessarily direct competitors. Routing manages model-call cost. Agentic RAG tests context sufficiency. Multi-model deliberation surfaces disagreement. Tests check formalized properties. A verification artifact should connect those signals to a decision about how far a candidate is supported.\n\n## 5. Trust debt and hidden checking work\n\nSuppose an engineering answer contains a set of material claims:\n\n```\nC = {c1, c2, ..., cn}\n```\n\nFor each claim, a reviewer needs to know whether it is supported by evidence, contradicted, or still an assumption. A rough diagnostic metric is:\n\n```\nevidence_coverage = supported_claims / total_material_claims\n```\n\nIf an answer contains 20 material claims and sufficient evidence exists for 12:\n\n```\nevidence_coverage = 12 / 20 = 60%\n```\n\nThe remaining 40% are not necessarily wrong. They are the area a reviewer still needs to inspect. If a tool does not expose that area, the engineer first has to discover it and only then verify it. That is hidden verification work.\n\nThe goal of a verification layer is not to declare an answer absolutely correct. It is to:\n\n- connect material claims to checkable evidence;\n- expose relevant targets that were and were not inspected;\n- separate assumptions from supported conclusions;\n- preserve critique and rejected hypotheses;\n- surface open production and PR risks;\n- narrow the manual search area without hiding uncertainty.\n\nReview remains. The search area should become smaller.\n\n## 6. When extra verification pays for itself\n\nIgnoring risk for a moment, an extra check costing `ΔC`\n\npays for itself\nwhen it saves at least `T_break_even = ΔC / R_hour`\n\n. At\n`R_hour = $80`\n\n:\n\n| Extra cost/run | Required review saving |\n|---|---|\n| $2 | 1.5 min |\n| $5 | 3.75 min |\n| $10 | 7.5 min |\n| $20 | 15 min |\n\nReducing `P_escape`\n\nby 0.1 percentage point — from 1.0% to 0.9% — at\n`L_escape = $10,000`\n\nyields:\n\n```\n(0.010 - 0.009) × $10,000 = $10 expected saving per run\n```\n\n| L_escape | Saving/run | Saving/month at 100 runs |\n|---|---|---|\n| $1,000 | $1 | $100 |\n| $10,000 | $10 | $1,000 |\n| $100,000 | $100 | $10,000 |\n| $1,000,000 | $1,000 | $100,000 |\n\nThis is an expected-loss model, not a measured product outcome and not literal insurance. Expensive verification can still be economically rational when a small reduction in failure probability protects against a large loss.\n\n## 7. One implementation used to test the hypothesis\n\nOne implementation we are building and evaluating is\n[Undes](https://undes.app). Multiple models, critique, consensus, and\nsynthesis are mechanisms. The product object being tested is a reviewable artifact\nthat aims to preserve:\n\n- the proposed solution or code candidate;\n- the evidence it rests on;\n- relevant targets that were and were not checked;\n- assumptions and claims that could not be proven;\n- critique and rejected hypotheses;\n- open production and PR risks;\n- recommended next checks;\n- a trust verdict.\n\nThe current state must be separated from the target model. The runtime normalizes\nverdicts to `PATCH_SAFE`\n\nor `DIAGNOSTIC`\n\nand stores a separate\n`patch-safe`\n\nboolean. Today it lands on\n`DIAGNOSTIC / patch-safe=false`\n\nmore often than not. The phrases “safe to\napply,” “needs review,” and “insufficient evidence” are human-facing interpretations\nof a trust boundary, not three implemented runtime enums.\n\nRouting is not a hidden automatic cost optimizer. Operators explicitly declare providers, models, and per-stage overrides. Single-model mode is opt-in and reports the absence of cross-model assurance. The accurate description is configurable, operator-controlled routing.\n\nThis does not establish product superiority. It identifies an implementation of an architectural hypothesis that still needs a comparative benchmark.\n\n### What the internal telemetry says\n\nAcross two internal evaluation runs, we measured input tokens spent before the first\ntargeted seam-fetch (`tokensBeforeFirstSeamFetch`\n\n):\n\n| Run | Total input tokens | Before first targeted fetch | Share |\n|---|---|---|---|\n| A | 322,807 | 170,162 | 52.7% |\n| B | 352,432 | 183,876 | 52.2% |\n| Weighted | 675,239 | 354,038 | 52.4% |\n\nThis is not the first evidence of any kind: a context pack and observed files were\navailable earlier. The metric marks the first targeted probe of a specific seam.\nBoth runs ended in `DIAGNOSTIC`\n\n, not trusted output.\n\nTwo observations are not a benchmark. They do not establish token or time savings. They frame a measurable hypothesis: targeted evidence acquisition starts late, so some reasoning may happen before key premises are tested.\n\n## 8. A falsifiable benchmark\n\nA minimum comparative protocol could be:\n\n```\n5 public repositories across different stacks\n20 tasks per repository\n4 workflow variants\n2 independent repeats\nTotal: 5 × 20 × 4 × 2 = 800 runs\n```\n\nWorkflow variants:\n\n- Strong single-model coding assistant.\n- Multi-model deliberation without a repository trust artifact.\n- Verification workflow in single-model mode.\n- Verification workflow in multi-model mode.\n\n| Metric | What it measures |\n|---|---|\n| Evidence coverage | Material claims tied to checkable evidence |\n| Unchecked relevant targets | Missed files, callers, and seams |\n| Unsupported-claim rate | Claims without sufficient grounding |\n| Missed-risk count | Ground-truth risks absent from output |\n| False-confidence rate | Confident verdict on a wrong candidate |\n| False-patch-safe | Unsafe result that passed the gate |\n| Avoidable-DIAGNOSTIC | Correct candidate rejected by an evidence-acquisition defect |\n| Reviewer minutes | Time to an apply/review/reject decision |\n| Model cost | Actual call cost |\n| Time to first targeted fetch | When targeted seam checking started |\n\nDo not collapse these into one composite score. A cheap unsafe answer does not become better, and an expensive “insufficient evidence” can be the correct result.\n\n## 9. Limits of what is proven\n\n- METR’s 19% slowdown is a specific RCT with early-2025 tools and experienced maintainers, not a universal result\n[[1]](#ref-1). - METR’s newer intervals include zero effect and are described as unreliable by the authors\n[[2]](#ref-2). - Google’s +34% concerns Agentic RAG factuality, not patch safety\n[[4]](#ref-4). - Multi-agent topology can improve or degrade results; consensus does not prove factual correctness\n[[6]](#ref-6). - Kilo figures reported by ByteByteGo are vendor-reported\n[[8]](#ref-8). - Two internal runs are too few for a performance claim.\n- A trust verdict is not a correctness guarantee; it requires calibration against false confidence and missed risks.\n\n## Conclusion\n\nRouting can materially reduce the model bill, especially in autonomous agentic loops. Agentic RAG checks whether retrieved context is sufficient. Multi-model deliberation surfaces consensus and contradictions, but its effect depends on topology and task shape. Tests and static analysis check formalized properties.\n\nHow far is the candidate supported by evidence, and what still needs human verification before merge?\n\nCheap inference, fast review, and a convincing artifact are worthless if they raise false confidence. The research hypothesis is that the value of a verification layer is determined not by how much code it generates, but by how much it narrows hidden checking work without increasing false confidence.\n\nUntil a comparative benchmark is run, this remains a grounded architectural hypothesis with working telemetry — not a proven productivity claim.\n\n## References\n\n[METR — Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity](https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study-paper.pdf).[METR — We are Changing our Developer Productivity Experiment Design](https://metr.org/blog/2026-02-24-uplift-update/).[Bacchelli, Bird — Expectations, Outcomes, and Challenges of Modern Code Review](https://www.cabird.com/pubs/bacchelli2013eoc.pdf).[Google Research — Agentic RAG and the Sufficient Context Agent](https://research.google/blog/unlocking-dependable-responses-with-gemini-enterprise-agent-platforms-agentic-rag/).[Google Research — Sufficient Context: A New Lens on RAG Systems](https://research.google/pubs/sufficient-context-a-new-lens-on-retrieval-augmented-generation-systems/).[Google Research — Towards a Science of Scaling Agent Systems](https://research.google/blog/towards-a-science-of-scaling-agent-systems-when-and-why-agent-systems-work/).[DORA / Google — Fostering Developers’ Trust in AI](https://research.google/pubs/fostering-developers-trust-in-ai/).[ByteByteGo — Token Spend Out of Control? The Case for Smarter Routing](https://blog.bytebytego.com/p/token-spend-out-of-control-the-case).[Google Cloud — 2025 DORA State of AI-assisted Software Development](https://cloud.google.com/blog/products/ai-machine-learning/announcing-the-2025-dora-report).[OpenRouter Fusion documentation](https://openrouter.ai/docs/guides/features/plugins/fusion).[Kilo AI Gateway documentation](https://docs.kilo.ai/docs/gateway).[Ong et al. — RouteLLM: Learning to Route LLMs with Preference Data](https://proceedings.iclr.cc/paper_files/paper/2025/file/5503a7c69d48a2f86fc00b3dc09de686-Paper-Conference.pdf).", "url": "https://wpnews.pro/news/ai-coding-agents-need-evidence-first-review-not-just-cheaper-routing", "canonical_source": "https://undes.app/blog/cheaper-ai-code-generation-engineering-cost", "published_at": "2026-06-24 18:06:39+00:00", "updated_at": "2026-06-24 18:09:42.086588+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "developer-tools", "ai-safety", "ai-research"], "entities": ["METR", "DORA", "Google", "Microsoft", "Bacchelli", "Bird"], "alternates": {"html": "https://wpnews.pro/news/ai-coding-agents-need-evidence-first-review-not-just-cheaper-routing", "markdown": "https://wpnews.pro/news/ai-coding-agents-need-evidence-first-review-not-just-cheaper-routing.md", "text": "https://wpnews.pro/news/ai-coding-agents-need-evidence-first-review-not-just-cheaper-routing.txt", "jsonld": "https://wpnews.pro/news/ai-coding-agents-need-evidence-first-review-not-just-cheaper-routing.jsonld"}}