{"slug": "review-is-the-symptom-specification-is-the-fix", "title": "Review is the Symptom. Specification is the Fix.", "summary": "Addy Osmani's analysis of AI code review reveals a crisis: 861% more code churn, defect rates up to 54%, and review duration up 441%. The data shows that generating code at 4x volume yields only 10% more delivered value, indicating that the bottleneck has shifted from writing to verification. Instead of improving review processes, the solution is to shift left with machine-enforceable specifications—type signatures, contract tests, and property assertions—that run in CI and fail the build when violated, providing faster feedback than human review.", "body_md": "Addy Osmani published [Agentic Code Review](https://x.com/addyosmani/status/2066595308629594363) the most data-rich, most honest assessment of the AI code-review crisis to date. The Faros numbers alone are worth the read: 861% more code churn, defect rates from 9% to 54%, review duration up 441%, and zero-review merges up 31% — not because anyone chose to skip review, but because reviewers simply could not keep pace. The GitClear finding distills the whole problem into one ratio: **4x the code for 10% more delivered value.**\n\nThe diagnosis is correct. Writing got cheap. Understanding didn't. The bottleneck moved to verification. Every number confirms it.\n\nThe prescription — review better, review smarter, tier by risk, run heterogeneous AI reviewers, keep the human on the merge button — is sound operational advice for the world as it exists right now. Teams should follow it. It will help.\n\nIt is also a human-scale solution to a machine-scale problem. Generation runs at machine speed. Review runs at human speed. No amount of improving human review closes a gap where one side accelerates and the other is fixed. The article's own data proves this: review duration up 441% because humans can't read faster, zero-review merges up 31% because humans gave up. Those aren't signs that review needs improvement. They're signs that review as a primary verification mechanism has hit a ceiling that better process cannot raise. You cannot solve a machine-scale problem with human-scale approaches.\n\nThe article accepts that agents generate code at 4x volume and asks: how do we review it? Every solution flows from that question — triage, tiering, AI-assisted review, smaller PRs, circuit breakers for high-maintenance changes.\n\nThe question it never asks: **why is 4x code being generated before anyone declared what correct means?**\n\nIf 4x the code produces only 10% more delivered value, then roughly 75% of what was generated — and what reviewers must now process — produced no value at all. That is not a review problem. It is a generation problem. Review is being asked to filter a firehose that should have been a garden hose, and the article proposes a better filter instead of asking why the hose is that wide.\n\nA specification — a human-authored declaration of what the output must satisfy, checked mechanically — does two things the review-centric model cannot:\n\nFor developers who find \"specification\" abstract: you already do this when you write a type signature, a property-based test, a schema, a contract test, or an interface definition. Formal methods, type-driven development, design-by-contract — these are the established disciplines that make specifications concrete and enforceable. The idea is not new. Applying it as the *primary* verification mechanism for agent-generated code is new.\n\nThis is *not* the 200-page requirements document from the waterfall era that was obsolete before anyone wrote a line of code. It failed for good reasons — it was a human-readable document that no machine enforced, written once and never updated, separated from the code it described. What we mean by specification here is the opposite in every dimension: **small, mechanical, enforceable declarations that live in the code and run on every commit.** A type signature is a specification. A contract test is a specification. A property assertion is a specification. An interface definition is a specification.\n\nThey're in the repo, versioned and run in CI. They fail the build when violated. They don't go stale because they execute. They don't get ignored because the machine enforces them. The waterfall specification was a document humans were supposed to read. This specification is a constraint machines are required to check. Same word, opposite mechanism, opposite failure mode.\n\nAgile's core contribution was the fast feedback loop: build a small increment, get feedback immediately, adjust. A type check that fails on commit is a faster feedback loop than a human reviewer who takes 441% longer to respond. A contract test that runs in CI on every push is a faster feedback loop than a code review that gets skipped 31% of the time. Machine-enforced specifications are Agile's fast-feedback principle applied at the architectural level — tighter loops, faster signals, every commit, no exceptions. The waterfall specification failed because it was slow feedback (write the document, wait months, discover it was wrong). This specification succeeds because it is the fastest feedback in the system.\n\nIt constrains generation *before* code exists, so the agent builds toward a declared target rather than generating plausible-looking output that a human must later evaluate for intent.\n\nIt enables mechanical verification *at machine speed*, so the deducible questions (\"does this satisfy the contract?\") are answered by a deterministic gate, not by a human reading diffs.\n\nReview is required when you have no specification. It is the manual, expensive, fatiguing reconstruction of intent that was never written down. The article's own best insight — that agent code lacks intent and the reviewer must reconstruct it — is one sentence away from the fix: **write the intent down first, as a specification, and verify against it mechanically.** The article reaches for decision logs on PRs. The fix is specifications before generation.\n\nThe GitClear number — 4x output, 10% more value — is the most important figure in the article, and the article underreads it.\n\nIf you generate four times the code and only a tenth of it produces additional value, you have an accumulation problem, not a review problem. Every unnecessary line generated is: tokens burned (inference cost), review time consumed (the 441% increase), churn created (the 861% increase), and defect surface expanded (the 9%-to-54% jump). The review crisis is downstream of the generation crisis. Better review cannot fix it because the input volume is the cause.\n\nThe alternative is the subtraction discipline: **maximize the work not generated.** A specification that declares what correct means constrains the agent to produce what's needed — not four variations of plausible-looking code that a human must evaluate. Fewer lines generated means fewer lines to review, fewer defects to catch, less churn to manage, and lower inference cost. The article's own numbers prove the case for specification-first development more than they prove the case for better review.\n\nThe article treats review as the verification layer. That conflation is the heart of the problem.\n\nReview is a human reading code and forming a judgment. It is subjective, it fatigues, it scales at human reading speed which the article correctly notes has not changed since we started staring at screens. It produces opinions, not verdicts. Verification is a deterministic check of output against a declared property. It does not fatigue, it scales at machine speed, and it produces verdicts — pass or fail, the same answer every time, reproducible on demand.\n\nThe article's data proves the distinction: review duration up 441% because humans can't read faster. Zero-review merges up 31% because humans gave up trying. A verification gate — a type checker, a property-based test suite, a contract enforcer — would not have slowed down. It would have checked every PR, at machine speed, against declared properties, with no 441% increase, no fatigue, and no \"zero-verification merges\" because a machine doesn't skip checks when it's tired.\n\nThe article mentions CI and tests but treats them as support for review — the wall that holds while humans do the real work. The inversion the data demands is: **verification is the primary gate. Review is for the questions verification can't answer.** Deducible questions (does this satisfy the contract, does this match the type, does this violate an invariant) go through the deterministic gate. Only the questions that require human judgment (is this the right change to build, does this architectural choice make sense, is this the right abstraction) go to a person.\n\nThat split is required to survive the volume. Not by reviewing 4x code 441% slower. By routing the deducible questions to a machine that handles them at machine speed, and reserving human attention for the genuinely uncertain questions that require it.\n\nThe underlying design principle is a separation: the system needs both human judgment and machine speed, but not in the same step. Humans in the enforcement loop are the weakest link. They get tired, skip and slow to 441% of baseline. Machines in the judgment loop are the weakest link. They lack business context, they can't declare intent, they confidently agree with their own errors. The resolution is to **separate them in space**: humans upstream, declaring intent and providing the judgment that no model has the context to make. Machines downstream, enforcing the declared intent at machine speed on every change, without fatigue, without skipping, without the 31% zero-review collapse. Slow and fast, each where they belong. The human's value is not reading diffs — it is deciding what correct means. The machine's value is not forming opinions — it is enforcing declared properties mechanically, on every commit, at the speed generation demands. Mixing them in the same step — a human reading machine-speed output line by line — is the contradiction the 441% number measures.\n\nWhen a human does review — when judgment is required on something the machine can't yet check — that judgment doesn't have to stay human-scale. Every insight a reviewer produces that can be expressed as a rule becomes a specification the machine enforces on every future commit. The human catches it once. The machine catches it forever after. Each review cycle permanently shrinks the pool of things requiring human judgment and permanently grows the pool of things mechanically enforced. Review stops being a recurring cost and becomes a one-time investment that compounds — each human judgment, once converted to a specification, never needs to be made again. The system gets smarter over time without getting slower.\n\nThis is not a novel design. Every domain that hit the same wall — machine speed exceeding human capacity to review and approve — arrived at the same separation. The pattern is a blueprint, not an opinion:\n\n| Domain | The speed problem | Human role (slow, upstream) | Machine role (fast, downstream) |\n|---|---|---|---|\nAviation |\nFlight dynamics exceed human reaction speed | Pilot declares intent (heading, altitude, approach) | Flight computer enforces envelope protection — prevents unsafe states at machine speed, regardless of pilot input |\nNuclear |\nReactor dynamics too fast for human monitoring | Engineers declare safety limits (technical specifications) | Automated interlocks enforce limits — shut down the reactor before a human could even read the gauge |\nTrading |\nTrades execute in microseconds | Risk managers declare position limits and loss thresholds | Pre-trade risk checks enforce on every order at machine speed — no order executes unchecked |\nManufacturing |\nAssembly lines too fast for human inspection of every unit | Engineers declare tolerances (specifications) | Automated quality gates measure and reject — sensors enforce what a human eye cannot keep up with |\nAutonomous vehicles |\nVehicle moves faster than human reaction time | Engineers declare safety constraints (maintain distance, stay in lane) | Control systems enforce at sensor speed — every millisecond, not every glance |\nSoftware (2026) |\nAgents generate code faster than humans can review | Human declares intent as specification |\nDeterministic gate verifies every commit against declared properties at machine speed |\n\nIn every case the solution is the same: **remove the human from the enforcement step where they are the bottleneck, and move them to the declaration step where they provide irreplaceable judgment.** The machine enforces what the human declared. No aviation regulator asks a pilot to read every control-surface adjustment and approve it. No nuclear regulator asks an operator to review every sensor reading and sign off. They declare the envelope. The machine enforces it. The same separation that made fly-by-wire safe, reactors survivable, and trading non-catastrophic is the separation software has not yet made — and the 441% review-duration increase is the cost of not making it.\n\nThe article's best conceptual insight is that agent-generated code lacks intent. The reviewer is \"the first human being to ever lay eyes on this code,\" and they must reconstruct a rationale that was never written down. The article frames this as a tooling problem — capture the agent's reasoning as a decision log on the PR, and the reconstruction cost disappears.\n\nThat is a step in the right direction and one step short of the fix.\n\nA decision log captures the agent's reasoning about *how* it implemented the task. It does not capture the human's judgment about *what* the task should have produced. The agent's reasoning is about its choices — why this data structure, why this API call. The intent that matters is upstream: what did the human want this change to accomplish, what properties must it satisfy, what invariants must it preserve?\n\nThat intent, captured as a **specification** before generation begins, does two things a decision log cannot:\n\nIt gives the agent a declared target — not \"implement this feature\" (a prompt) but \"the output must satisfy these properties\" (a spec). The agent generates toward a known target rather than producing plausible output a reviewer must evaluate.\n\nIt gives the verification gate something to check against. A decision log helps a reviewer understand. A specification lets a machine verify. The 441% review-duration increase exists because reviewers are doing manually, per-diff, what a specification-and-verification system would do mechanically, per-commit, at machine speed. In a specification-first world, the human's job is **requirement engineering** — declaring what correct means, what properties must hold, what invariants the system must satisfy. Not reading diffs. Not checking syntax. Not reconstructing intent from code. Defining intent before code exists, in a form machines can enforce.\n\nThe four-reviewer experiment — 93.4% non-overlapping findings across CodeRabbit, Greptile, Sentry Seer, and Cursor BugBot — is interesting data. Different tools catch different bugs. Heterogeneity helps. The article correctly argues for running multiple reviewers with different strengths.\n\nBut heterogeneous AI review is still probabilistic review of probabilistic output. [Huang et al. (ICLR 2024)](https://arxiv.org/abs/2310.01798) established that models cannot self-correct reasoning without external feedback and that performance can degrade after self-correction. The article acknowledges the risk — \"broadly correlated blind spots, especially when they come from the same family, confidently agreeing in the same places\" — and calls it \"borrowed confidence.\"\n\nThe fix for borrowed confidence is not more diverse models. It is a **different kind of instrument** — one that is deterministic, independent of all generators, and checks output against declared properties rather than forming an opinion about it. A type checker doesn't have blind spots correlated with GPT-5. A contract enforcer doesn't fatigue. A property-based test doesn't \"confidently agree\" with the generator — it either passes or fails, based on properties a human declared.\n\nAI reviewers are useful. Run them. They catch real bugs. But they are sensors, not oracles. The oracle is the specification plus the deterministic gate. The sensors help. The oracle decides.\n\nThe article describes \"loop engineering\" — automating review into the generation loop so an agent writes, a judge agent reviews, and the loop continues. The article correctly identifies the risk: a closed loop of models with no human anywhere is \"borrowed confidence\" where \"the system's certainty becomes yours, and nobody actually understood anything.\"\n\nThe proposed fix: \"the human moves up a level.\"\n\nBut moving the human up doesn't fix the closed loop — it just means the human sees less of what the loop produces. The loop is still models reviewing models. The models still have correlated failure modes. The output is still probabilistic. The verdicts are still opinions.\n\nWhat fixes the loop is introducing an element that is not a model: a deterministic verification gate inside the loop, between iterations, checking each iteration's output against declared properties before the next iteration builds on it. If iteration three violates a property, the gate catches it before iterations four through twenty compound the error. That is not a model reviewing a model. It is physics checking the bridge.\n\nThe article's data proves this is needed. The Faros numbers — 861% churn, 242.7% more incidents — are what happens when the loop runs without a deterministic gate. The code churns because nothing mechanically prevents it from drifting. The incidents rise because nothing mechanically verifies correctness between iterations. Better review inside the loop doesn't fix it, because review is the thing that already failed to scale.\n\nThe article's closing line is right: \"your job is to deliver code you have proven to work.\" But proving code works is not the same as reviewing it until a person feels confident. Proof requires a declared standard (the specification), a mechanical check (the verification gate), and a reproducible verdict (pass or fail, same answer every time).\n\nReview is how you prove things when you have no specification. It is the most expensive, least scalable, most fatigue-prone way to establish that code is correct. It is the only option when nobody declared what correct means. The moment you declare it — as a type, a contract, a property, an invariant — the deducible portion of the review burden transfers to a machine that handles it at machine speed, without the 441% slowdown, without the zero-review merges, without the borrowed confidence of models agreeing with models.\n\nThe article's data is the best evidence published this year that the review-centric model is collapsing under the weight of machine-speed generation. The fix is not better review. It is making the intent explicit before generation, so that verification — mechanical, deterministic, tireless — can do the work that review was never built to do at this volume.\n\nWriting got cheap. Understanding didn't. But understanding doesn't have to be reconstructed from every diff by a fatigued reviewer at 441% the old cost. It can be **declared once, up front, as a specification** — and then verified mechanically, at machine speed, on every change, forever. That is how you make understanding scale with generation. Everything else is a more expensive way to not keep up.\n\n*This is a direct response to Addy Osmani's article on code review in the age of AI agents (2026). The data in that piece is the best in the field and the diagnosis is correct. The gap is in the prescription: review-centric solutions treat the downstream symptom while the upstream cause — generating code before declaring what correct means — continues unchecked. The formal basis: Huang et al., \"Large Language Models Cannot Self-Correct Reasoning Yet\" (ICLR 2024). If you think review scales to machine-speed generation without a specification layer underneath it, that's the specific disagreement worth having.*", "url": "https://wpnews.pro/news/review-is-the-symptom-specification-is-the-fix", "canonical_source": "https://dev.to/bala_paranj_059d338e44e7e/review-is-the-symptom-specification-is-the-fix-97h", "published_at": "2026-06-17 10:19:27+00:00", "updated_at": "2026-06-17 10:51:47.796514+00:00", "lang": "en", "topics": ["artificial-intelligence", "developer-tools", "large-language-models", "ai-agents", "ai-safety"], "entities": ["Addy Osmani", "Faros", "GitClear"], "alternates": {"html": "https://wpnews.pro/news/review-is-the-symptom-specification-is-the-fix", "markdown": "https://wpnews.pro/news/review-is-the-symptom-specification-is-the-fix.md", "text": "https://wpnews.pro/news/review-is-the-symptom-specification-is-the-fix.txt", "jsonld": "https://wpnews.pro/news/review-is-the-symptom-specification-is-the-fix.jsonld"}}