{"slug": "review-doesnt-scale-validation-does", "title": "Review Doesn’t Scale, Validation Does", "summary": "Mike, in collaboration with Anthropic's Claude, argues that code review does not scale to the volume of code produced by AI agents, and that validation must replace it as the primary safeguard. The engineer contends that plan review remains necessary but insufficient, and that teams must design systems where correctness can be verified without reading every line of agent-generated output.", "body_md": "**By Mike, in collaboration with Claude (Anthropic)**\n\nThe main guide’s Chapter 4 calls plan review the primary skill of the agentic era. That’s correct as far as it goes — reviewing what an agent intends before it executes is small, tractable, and underused. But it’s not sufficient, and the way most teams talk about “AI code review” suggests they haven’t noticed why.\n\nThe same agent that produces a clean 200-line plan produces 2,000 lines of code from it. The plan you can review. The code you can’t — not really, not at the volume agents produce it. What looks like review at that scale is mostly skimming, pattern-matching, and trust.\n\nThis chapter narrows Chapter 4’s claim to one the evidence supports:\n\n**Plan review is necessary but insufficient. At agent volume, code review doesn’t scale — validation does. The skill that matters is designing the system so correctness can be verified without anyone reading every line.**\n\nThat shifts where the engineer’s attention goes: less “did the agent write the right code?” (which assumes you can tell), more “is correctness defined sharply enough, and checked broadly enough, that the agent can’t drift without something noticing?” Plan review still happens. It just stops being the load-bearing safeguard.\n\nThe alternatives organize into three properties the environment needs to provide, mirroring the Chapter 3 deep-dive’s structural move:\n\nEach is a layer of validation that doesn’t depend on reading the code. Each is individually insufficient. The defense-in-depth comes from combining all three — and from keeping plan review as the upstream gate that defines what each layer checks against.\n\nCode review worked because humans wrote code at human speed. A 200-line PR was hours of thinking; reading it took minutes; the ratio favored the reviewer. And the reviewer wasn’t just checking syntax — they were reconstructing the author’s reasoning from code that carried enough of it to make reconstruction possible.\n\nAgents broke the ratio. Chapter 2’s evidence: Faros AI measured a 91% increase in review time and 154% increase in PR size after AI adoption; Jellyfish saw roughly 2x PR throughput at full adoption; GitClear’s 2026 cohort found power users authoring 4–10x more code. The volume is measured, not theoretical.\n\nAt that volume the cognitive task changes, not just the time. A reviewer facing 2,000 lines of agent output isn’t reconstructing the agent’s reasoning, because the agent didn’t reason like a human — there’s no intuition to recover. So the reviewer either reads carefully enough to verify each line (slower than writing it would have been, which defeats the premise) or skims for surface patterns and rubber-stamps the rest. At scale, the second is what happens.\n\nChapter 1’s Moltbook case is the clean example: AI-generated code passed all functional tests, the app shipped, 1.5 million API keys leaked because Row Level Security was never enabled. The code worked. The review didn’t catch what wasn’t there.\n\nKarpathy named the accumulation *comprehension debt* — what builds up when agents one-shot code nobody reads. Unmesh Joshi and Martin Fowler reached the same place from a different angle in a January 2026 conversation, calling it *cognitive debt*: LLM-generated code without shared understanding leaves the team unable to evolve what it shipped. Two senior practitioners, two names, one phenomenon — it grows mechanically with throughput unless something other than human reading is closing it. That’s the inversion at the heart of this chapter: review presumes someone reads carefully enough to catch errors, and at agent volume that presumption breaks. Validation replaces it with mechanisms that don’t depend on careful reading.\n\nThe plan still matters — it’s small, and it tells you what the agent intends. But once code starts flowing, the question stops being “did I review this?” and becomes “what’s catching the things I didn’t?”\n\nThe cheapest bug to catch is the one the agent never writes. Specification makes “what the agent should produce” precise enough that wrong intent surfaces during the spec phase, the agent has an unambiguous target, acceptance criteria become testable constraints rather than English-language hints, and drift becomes detectable because there’s something concrete to drift from.\n\nThe shift is from “tell the agent what to build” (a prompt) to “define what ‘built’ means, then let the agent build” (a contract). A January 2026 practitioner account of Kiro’s spec mode captured the effect: acceptance criteria stopped being guidance and became constraints the system enforced — not because the author tried harder, but because deviation became obvious.\n\nThe academic framing arrived the same month. Piskala’s “Spec-Driven Development: From Code to Contract” (arXiv, January 2026) formalizes a **specification spectrum** — code-first → spec-first → spec-anchored → spec-as-source — where moving right increases the spec’s authority over the code, and the discipline required to keep them aligned. Most teams sit at code-first (specs written after, if at all, and drifting); spec-first is the entry point (a spec guides the initial build, may not be maintained); spec-anchored keeps the spec as living documentation synced with code; spec-as-source treats the spec as the real artifact and regenerates code from it. The paper is a framework-and-case-studies guide, not a controlled trial, so it formalizes the practice rather than proving it improves outcomes — but its four-phase workflow (Specify, Plan, Implement, Validate) maps almost exactly onto this chapter’s argument: plan review sits upstream, validation carries the close.\n\n**Amazon Kiro** (GA August 2025) is a VS Code fork built around spec-driven development with explicit phases: requirements gathering produces user stories with acceptance criteria; technical design produces architecture and schemas; task breakdown produces a sequenced plan; only then does the agent execute. Each phase is reviewable before the next runs, in both Requirements-First and Design-First variants, with a separate Bugfix Spec mode. The insight worth naming: Kiro doesn’t try to make agent *output* reviewable. It makes the *spec* reviewable — where volume is small and stakes are clearest — then constrains the agent to implement against it.\n\n**GitHub Spec Kit** (open-sourced September 2025, MIT) is the tool-agnostic version: templates, a CLI, and prompts that center work on specification → plan → small testable tasks, with the agent (Copilot, Claude Code, Gemini CLI, Cursor, or any of 30+ integrations) doing the implementation. GitHub’s framing: teams treat coding agents like search engines when they should treat them like literal-minded pair programmers who need unambiguous instructions.\n\nThe two converging on the same workflow shape is itself informative — Kiro is one vendor’s bet, Spec Kit is GitHub’s open-source standardization of what’s becoming a category convention. Neither is complete: both still depend on the spec being well-written, and neither stops a bad spec from producing bad code. But both move human review upstream to where it’s cheapest.\n\nBehavior-Driven Development predates the agentic era by fifteen years, but its core artifact — executable acceptance criteria in Given-When-Then form — now does work it wasn’t designed for: a specification format humans and agents read with the same meaning. A single Gherkin scenario is simultaneously a human-readable requirement, an agent-readable spec, and (with a step-definition layer) an executable test. The same artifact carries intent through three audiences without translation loss — and translation loss is exactly where ambiguity-driven bugs are born.\n\nWhat’s new: BDD scenarios are now inputs to the agent, not just verification after the fact. The Gherkin file that runs as a test can be the contract referenced in the agent’s context. The catch is that this only works if the scenarios are written *before* implementation. Scenarios the agent generates *after*, against the code it just wrote, are descriptions, not specifications — and they carry the self-correction blind spot the main guide flags in Chapter 7.\n\n**Auto-generated specs.** Same failure mode as auto-generated AGENTS.md files in the Chapter 3 deep-dive: if the system writing the spec is the same kind that reads it, you’ve added tokens, not constraints. A spec the agent produced from a vague prompt and then implemented against is “the agent did what it wanted and documented it,” not spec-driven development.\n\n**Specs that drift from implementation.** The first time a spec and the code disagree and the team trusts the code, the spec is decorative from then on. The fix is the same as for any living artifact: review in PRs, treat changes as substantive, and ideally fail CI when the spec and the tests derived from it disagree.\n\n*Sources: Kiro Documentation & Specs guide (aws.amazon.com); InfoQ, “Beyond Vibe Coding: Amazon Introduces Kiro” (Aug 2025); GitHub Blog, “Spec-driven development with AI” (Sep 2025); DEV.to, “What I Learned Using SDD with Kiro” (Jan 2026); Piskala, “Spec-Driven Development: From Code to Contract,” arXiv:2602.00180 (Jan 2026).*\n\nSpecification defines what should be true. Verification continuously checks whether it is, without anyone reading the code to find out. This is the layer that has to scale with agent output, because it’s the only one that can — humans don’t read more carefully when a codebase grows, but automated checks fire regardless of size.\n\nThe Chapter 3 deep-dive treated tests as the agent’s feedback loop. The framing here is about *your* needs: when you stop reading code, tests become the only thing reliably standing between agent output and production. That changes the job. In a human-review world, tests catch regressions while reviewers catch logic and design problems. In an agent-volume world where reviewers skim, tests have to do both — define the contract the agent implements against *and* catch where the implementation deviates.\n\nSo coverage stops being a hygiene metric and becomes load-bearing — not in the gameable line-coverage sense (agents hit 100% line coverage on code that doesn’t do what the spec says) but in the behavioral sense: are the contracts the spec defines actually exercised under the conditions where they could fail?\n\nThis is where TDD’s order matters. Tests written before implementation describe intent; tests written after describe behavior. With agents that’s the whole difference between specification and post-hoc rationalization. Chapter 1’s Reco rewrite worked partly because 1,778 JSONata test cases existed *before* the AI was involved — the AI’s job was to make them pass, not to invent what “pass” meant. Martin Fowler’s January 2026 fragment, citing Unmesh Joshi, framed it as a forcing function: directing thousands of lines of generated code requires something that makes you understand what’s being built, and the test is that something.\n\nThe calibrated claim: TDD-style workflows (tests-as-specification, small steps, fast green/red cycles) are well-suited to agent feedback loops, and teams shipping clean agent-generated code are nearly all using something in the TDD family. Whether that’s causal or merely correlated with broader engineering discipline is empirically unsettled — there’s no RCT, only consistent practitioner experience.\n\nOne technique deserves a specific mention because it targets the agent failure mode directly: **property-based testing** (Hypothesis, fast-check, jqwik). Agents write code that works on the inputs you mentioned and fail on the ones you didn’t. Property tests — “for all inputs satisfying P, the output satisfies Q,” with the framework generating thousands of adversarial cases — close that gap mechanically. They’re one of the few approaches actively *better* at catching agent-generated bugs than human-written ones, because they exercise exactly the “didn’t think of this case” failure agents are prone to. The cost is thinking in invariants rather than cases, but those invariants then cover every future version an agent refactors the module into.\n\nChapter 1’s two cleanest successes relied on something stronger than tests — an external system as ground truth. Reco ran gnata in shadow mode for days: production traffic still served by jsonata-js, gnata evaluating every expression in parallel, mismatches logged, promotion only after three consecutive days of zero mismatches. Carlini used GCC as a compilation oracle: build most kernel files with GCC, a random subset with Claude’s compiler, and if the kernel broke the bug was in the subset — turning one monolithic question into many parallelizable ones.\n\nThe shared pattern is **differential verification**: comparing the agent’s output against a trusted reference instead of only against pre-written tests. It’s expensive — you need a reference — but where one exists it’s the strongest verification available, because it catches anything that diverges from known-good behavior, not just the bugs you anticipated. For most teams the equivalent is running new alongside old and diffing outputs, testing migrations against a copy of production data, or canary-comparing API responses in live traffic. It’s the mechanism that bridges from “the spec says X” to “the system does X under load, on real data” — somewhere tests can’t reach.\n\nThe techniques above only matter if they run, and the volume problem applies to CI too: 10x the code with flat CI capacity, and CI becomes the bottleneck. Carlini’s compiler made it concrete — without externally enforced CI, agents broke existing functionality faster than they improved it. CI wasn’t a quality gate; it was the only thing between agent productivity and net regression.\n\nGood CI in the agentic era is fast (slow CI means people merge on faith), multi-layered (local → PR → nightly, each catching different categories), and inclusive of non-test checks — static analysis, security scans, complexity gates, license checks. He et al. found static-analysis warnings up 30% and complexity up 41% after Cursor adoption; if CI doesn’t catch those, nothing does. And it must be independent of the generation tool: the agent that wrote the code doesn’t get to decide CI passed (the main guide’s Chapter 7 point, applied here).\n\nVerification is necessary, not magic. Three failures escape it. **Specification gaps** — tests verify what they were written to verify; if the spec didn’t mention CSRF, the tests won’t, and the Tenzai study found all 15 AI-built apps lacked it. That’s a specification failure verification can’t compensate for. **Emergent behaviors** — race conditions, load-dependent degradation, production-data-only bugs; property tests and shadow mode close some, nothing closes all. **Semantic properties tests can’t express** — “maintainable” has no unit test; complexity metrics are a proxy, not the thing.\n\nSo verification doesn’t replace review entirely. It handles what it can (most cases) and concentrates the residual human effort on the parts automation can’t reach — a much smaller, more focused task than reading every line, and the only version of code review that scales with agent output.\n\nThe first two layers reduce the probability of bad output reaching production. They don’t reach zero, and at agent volume “rare” happens often enough to plan for. Containment ensures that when something slips through, the damage is bounded and recoverable. It’s the layer the main guide’s Chapter 4 doesn’t address, because it’s not about reviewing intent — it’s about engineering the environment so bad intent is survivable. The SRE term is **blast radius management**: limiting impact on the assumption that failures are inevitable.\n\nChapter 1’s failure cases are containment failures, not detection failures. Replit/SaaStr: the agent deleted a production database during a code freeze — the failure wasn’t the mistake (agents make mistakes) but that “delete production database” was an autonomous action with no dev/prod separation and no gate on irreversible operations. Amazon/Kiro: an agent decided to delete-and-recreate a production environment, inheriting an engineer’s elevated permissions and bypassing a two-person approval built for humans. Moltbook: agents shipped a database with Row Level Security disabled. In all three the bug wasn’t the mistake — it was that the mistake reached production uncontained. Better tests wouldn’t have caught the SaaStr deletion (the code worked as instructed); better review wouldn’t have caught Moltbook (functional tests passed). What was missing was the layer saying “this action is too dangerous for an agent to take unapproved.”\n\nThe practical mechanisms are unglamorous infrastructure that pays for itself the first time it prevents an incident:\n\n`DROP TABLE`\n\n, `rm -rf`\n\n, `terraform destroy`\n\n, force-push to main. The agent proposes; a human or separate check approves. The friction is the point.Two caveats keep containment honest. Some operations have no good reversibility — sending email, charging a card, posting publicly, side-effecting external APIs — and there the spec-and-verify layers carry more weight because there’s no take-back. And defense-in-depth degrades when a team leans on one layer: if specification is sloppy and verification patchy, containment ends up doing work it wasn’t designed for, and the team learns its limits the hard way. Containment doesn’t replace the other two; it determines whether their inevitable failures are survivable.\n\nThe main guide’s Chapter 4 isn’t wrong about plan review — it’s incomplete. Plan review still earns its keep as the upstream gate: the plan is small enough to read carefully, it’s where intent is most legible (before translation to code, before scope drifts, before the sunk-cost trap kicks in), and it’s the cheapest place to catch “we’re solving the wrong problem” and “this doesn’t fit how we build things.”\n\nWhat the chapter argues against is stopping there. The implicit model in much AI-review discourse is “review the plan, then review the code, and you’ll catch the problems.” The first half works. The second doesn’t, because code volume defeats the reviewer. Keep the first half; replace the second with validation that doesn’t depend on reading:\n\nThat last reallocation is the point. The reading you used to spread thinly across every line now concentrates on the small fraction where it’s high-leverage. Lower total load, higher catch rate on what matters, the rest automated. That’s review when it stops pretending to scale and starts scaling for real.\n\nThe three layers compose into defense-in-depth. The ranking — roughly by impact-per-effort — is the actionable part, because the items themselves aren’t novel; their priority order in an agent-volume world is:\n\nWhat’s specific to the agentic era isn’t the list — most of it is ordinary modern practice — it’s the ranking. Pre-AI, code review sat near the top and the rest was good hygiene. At agent volume, code review demotes itself out of the top spots (not unimportant, just non-scaling) and the hygiene items move up because they’re the things that *do* scale. Plan review survives at the top by staying small and upstream.\n\n**Specification.** Before the agent writes code, is there a concrete artifact — acceptance criteria, BDD scenarios, a spec — a different engineer could implement against? When the implementation differs from the spec, do you trust the spec or the code? (If the code, the spec is decorative.)\n\n**Verification.** If a regression were introduced today, would CI catch it before merge or would it ship and reveal itself in production? Are your tests defining the contract the agent implements against, or describing the behavior it already produced? What fraction of your verification depends on the same model that generated the code? (Anything above zero is worth a hard look.)\n\n**Containment.** What can your agent do without asking — modify files, run commands, access production? If it did the wrong thing in the worst way right now, what’s the largest blast radius, and is it survivable? Could you roll back the last agent-generated change in under five minutes?\n\n**The summary question.** If you removed code review tomorrow and replaced it with what’s already in your specification, verification, and containment layers — would the system catch what review currently catches? If review is doing work nothing else is, that work won’t scale with agent output. The validation infrastructure has to take it over, or it stops getting done.\n\nPlan review is the primary *intent-side* skill of the agentic era; validation is the primary *output-side* skill. Together they’re the engineering envelope around agent work. Apart, each is insufficient.\n\nThe shift is small but consequential: less time pretending to review code too voluminous to actually review, more time on the artifacts that define what the agent should produce, the systems that check whether it did, and the constraints that make failure recoverable. The work doesn’t disappear — it moves to where it compounds.\n\nWhat survives across tooling generations isn’t any spec format or test framework or containment pattern. It’s the recognition that at agent volume, human reading isn’t the safeguard you can lean on. Teams that build accordingly keep their systems trustworthy as agent capability scales. Teams that don’t keep generating more code, faster, with less of it understood — until one change reaches production with consequences nothing was watching for.\n\nThe skill is designing the envelope. The agent does the typing.\n\n| Source | Year | Relevance |\n|---|---|---|\n| Kiro Documentation & Specs guide (AWS) | 2025–2026 | Spec-driven workflow; Feature/Bugfix specs; Requirements-First, Design-First |\n| InfoQ, “Beyond Vibe Coding: Amazon Introduces Kiro” | 2025 | Independent reporting on Kiro’s three-phase workflow |\n| GitHub Blog, “Spec-driven development with AI”; github/spec-kit | 2025–2026 | Open-source SDD; 30+ agent integrations |\n| DEV.to, “What I Learned Using SDD with Kiro” | 2026 | “Acceptance criteria as enforced constraints” |\n| Piskala, “Spec-Driven Development: From Code to Contract” (arXiv:2602.00180) | 2026 | Formalizes spec spectrum (spec-first → spec-anchored → spec-as-source); Specify/Plan/Implement/Validate workflow |\n| Martin Fowler Fragments (Jan 8), citing Unmesh Joshi | 2026 | TDD as forcing function for understanding agent output |\n| Fowler, Joshi & Parsons, “LLMs and the what/how loop” conversation (Jan 21) | 2026 | “Cognitive debt” from LLM code without shared understanding |\n| “Coding Is Like Cooking” TDD survey; Eric Elliott TDD+AIDD | 2025–2026 | TDD workflow adaptations; test speed as feedback determinant |\n| Karpathy, “notes from claude coding” | 2026 | Tests-first pattern; comprehension debt |\n| Carlini, “Building a C compiler with parallel Claudes” | 2026 | GCC oracle; CI as regression guardrail; differential verification |\n| Barak, “We Rewrote JSONata with AI in a Day” | 2026 | Shadow mode; 1,778 test cases as pre-existing spec |\n| Lemkin X posts; Fortune; Fast Company | 2025 | Replit/SaaStr containment failure |\n| Financial Times Amazon/Kiro investigation | 2026 | Permission inheritance; outage chain |\n| Wiz Research, Moltbook breach | 2026 | RLS-disabled spec gap reaching production |\n| Faros AI telemetry; Jellyfish 20M PR analysis | 2025–2026 | 91% review-time increase, 154% PR inflation; 2x throughput |\n| GitClear 2026 cohort follow-up | 2026 | Power users authoring 4–10x more code |\n| He et al., “Speed at the Cost of Quality” (MSR ’26) | 2026 | 30% static-analysis warning increase; 41% complexity increase |\n| Tenzai security study | 2025 | 69 vulnerabilities across 15 AI-built apps; spec-gap pattern |\n| Altimetrik / Apono / Lumos blast-radius references | 2024–2026 | Established SRE/security containment terminology |", "url": "https://wpnews.pro/news/review-doesnt-scale-validation-does", "canonical_source": "https://dev.to/my2centsonai/review-doesnt-scale-validation-does-2e23", "published_at": "2026-06-05 08:03:57+00:00", "updated_at": "2026-06-05 08:43:19.074829+00:00", "lang": "en", "topics": ["ai-agents", "ai-safety", "ai-research", "ai-tools", "mlops"], "entities": ["Mike", "Claude", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/review-doesnt-scale-validation-does", "markdown": "https://wpnews.pro/news/review-doesnt-scale-validation-does.md", "text": "https://wpnews.pro/news/review-doesnt-scale-validation-does.txt", "jsonld": "https://wpnews.pro/news/review-doesnt-scale-validation-does.jsonld"}}