{"slug": "ai-writes-code-faster-why-hasnt-delivery", "title": "AI Writes Code Faster. Why Hasn’t Delivery?", "summary": "AI code generation has dramatically accelerated the coding step, but software delivery speeds remain bottlenecked by review, testing, CI/CD, and release processes that haven't kept pace. According to an engineer at OceanBase, when only the \"writing code\" step speeds up while validation, integration, and recovery stay the same, organizations simply accumulate more PRs waiting for review and features waiting to ship rather than achieving true 10× delivery. The real shift requires changing how teams set goals, verify code, ship features, and contain risk — otherwise AI multiplies congestion downstream rather than actual delivery velocity.", "body_md": "*The bottleneck didn’t disappear — it moved downstream.*\n\nWhen coding gets 10× faster but review, CI, and release don’t, you don’t get 10× delivery. You get a longer queue behind the keyboard.\n\n*If your stack includes a database layer, that queue often shows up twice — once in application CI, again in migrations, backups, and failover. The closing section links how we think about that at OceanBase.*\n\nI recently watched [an interview with Cat Wu](https://www.youtube.com/watch?v=PplmzlgE0kg) on how Anthropic’s product team went from shipping a feature every few months to every few weeks, sometimes days, and for small slices of work — even within a single day. Our team has been having a parallel conversation: what concrete practices actually turn AI speed into delivery speed?\n\nMy main takeaway isn’t “AI writes code scary fast.”\n\nThat’s the shallow read.\n\nLots of teams now claim delivery is **10× or 50× faster** with AI.\n\nI’m skeptical.\n\nBecause one thing gets conflated all the time:\n\n**Faster code generation is not the same thing as faster software delivery.**\n\nAn agent can draft a patch in ten minutes. Sure. But whether that patch can land on `main`\n\n, be validated, reach real users, and be debugged or rolled back when it breaks—that’s a different system entirely.\n\n**If only the “writing code” step speeds up while review, testing, release, monitoring, and rollback stay the same, “50× faster” is often a local illusion: one part of the pipe got hot; the org is still stuck.**\n\nMost teams, in my view, aren’t at the “quantity becomes quality” inflection yet.\n\nThe real shift isn’t “everyone runs more coding agents.” It’s that **how you set goals, verify code, ship features, and contain risk** has to change. Otherwise AI doesn’t multiply delivery — it multiplies **PRs waiting for review, features waiting for validation, and branches waiting to merge**.\n\nAs [CircleCI](https://circleci.com/landing-pages/assets/2026-state-of-software-delivery-report.pdf) puts it, success in the AI era is “no longer determined by how fast code can be written” — the decisive factor is whether you can “validate, integrate, and recover at scale.” For most teams, that’s a sharper framing than asking whether AI can generate code at all.\n\nWhat’s more interesting: Cat has said outright that **new internal models weren’t the main driver** of faster iteration. The lever was **process** — how goals are set, how docs are written, how previews ship, how cross-functional work runs, and who has authority to put something in front of users.\n\n**Shipping a feature a day isn’t about whether the model can code. It’s about whether the org removed what used to slow releases down.**\n\nHere’s the thesis up front:\n\n**When code gets cheap, the expensive thing becomes judgment.**\n\nJudgment about what’s worth building, how good is good enough to ship, where a human must decide, and where an agent can run to completion.\n\nSo “AI-native” speed isn’t “everyone uses AI to write code.” Two things have to happen together:\n\n**Less idle motion in the process. More clarity in the rules.**\n\nLess idle motion means fewer documents, handoffs, approvals, and waits that exist only because “that’s how we’ve always worked.” More clarity means goals, evidence, verification, permissions, release, and rollback are specified — not reinvented in every hallway conversation.\n\nOnly when both are true does “one release per day” stop being a gamble.\n\nSeeing Anthropic ship often, many people jump to: strong models, engineers on Claude Code, therefore fast coding.\n\nThat helps. It isn’t the main story.\n\nIf coding is the only thing that speeds up, what do you get?\n\nMore PRs waiting for review\n\nMore features waiting for validation\n\nMore half-finished branches\n\nMore production risk\n\nMore arguments about **whether this is shippable at all**\n\nThat’s not faster delivery. It’s **congestion moved downstream** from typing to everything after typing.\n\nA routine feature might have taken days from kickoff to first PR. Now a crisp small ask can get a first diff in minutes. But “code exists” is a short leg of the journey.\n\nAfter that comes a whole chain: Should we build this at all? Is this the right shape? Who owns blast radius? Is test evidence enough? Can we spin a preview? Should this sit behind a flag? How do we roll back?\n\nWhen coding was slow, that chain was **masked by coding time**. Code in thirty minutes, CI still twenty, review still half a day, QA still queued — and the backlog becomes visible.\n\n[Qovery’s piece on AI and DevOps in 2026](https://www.qovery.com/blog/ai-devops-2026-cicd-pipeline-bottleneck) makes the same point from the platform side: when AI coding tools explode code throughput, **CI/CD, environment provisioning, and deployment pipelines** — not typing speed — become the constraint. As they put it, the bottleneck has **flipped**: less time coding, more time waiting on builds, previews, and deploys.\n\nSo I’ve stopped framing R&D efficiency as **lines of code per hour**.\n\nA better line:\n\n**AI compresses the coding segment and forces you to face the real system bottleneck.**\n\nThe lesson from Anthropic’s public interviews isn’t simply “they had a better model.” Cat Wu has said internal model use raised shipping speed only “a little bit,” with “the bulk of the increase” coming from process and team expectations. When code gets cheaper to write, both Wu and Mike Krieger describe bottlenecks shifting — to deciding what to build, merge queues, CI, and the other steps that used to hide behind slow coding.\n\nIn classic product development, the PRD (product requirements document) often functions as **comfort food**.\n\nLonger feels more professional. More edge cases upfront feels more in control. Before engineering starts, you get fifteen pages of background, scope, flows, exceptions, competitors, and timeline.\n\nIn AI-native teams, that pattern ages badly.\n\nNot because docs don’t matter — because **long docs often disguise decisions as description**.\n\nWhat matters isn’t pre-specifying every button and edge case. It’s being explicit about **goal, principles, and how you’ll know it worked**.\n\nCat’s interview: routine features don’t need a novel-length PRD; large infrastructure bets still might. Many features need **one page** — goal, principles, metrics — and then people with context make local calls.\n\nThat’s a real role change.\n\nPM value used to be: “I specified it completely; engineering executes.”\n\nNow it’s closer to: **“I made the goal and tradeoffs legible so people with context can decide quickly.”**\n\nThat matters more with agents in the loop. If every micro-decision waits on PM, you’re back to pre-AI cadence. If the goal is fuzzy, agents and humans will sprint in the wrong direction together.\n\nA one-pager isn’t laziness. It should answer:\n\n```\n## Goal\nWhat user problem are we improving?\n\n## Non-goals\nWhat are we explicitly not solving this round?\n\n## Principles\nWhen tradeoffs appear, what wins?\n\n## Success signals\nWhat observable outcomes mean \"keep going\"?\n\n## Risks\nWhat must never be touched without a human? What needs explicit sign-off?\n```\n\nThat beats twenty pages of “looks complete” for agents and engineers alike. Agents don’t need literary prose — they need **bounds and decision rules**. Engineers don’t need a script — they need to know where they can decide alone.\n\nAnother Anthropic habit: lots of capabilities land first as **research previews**.\n\nThat can sound like “ship half-baked work.” It isn’t lowering quality — it’s **narrowing the promise**.\n\nWhy do traditional teams ship slowly?\n\nEvery release carries a hidden contract: stable, complete, friendly to all users, documented, consistent, no obvious landmines.\n\nThat contract is heavy.\n\nSo teams enter wait mode: one more edge case, one more design pass, one more test round, one more stakeholder sync.\n\nSix months later, users still haven’t touched it.\n\nResearch previews change **expectations upfront**:\n\n*This is early. We’re still finding the shape. You can try it — it isn’t the final product.*\n\nThree effects:\n\nYou don’t have to wait for perfect.\n\nReal user signal arrives earlier.\n\nBad product directions fail faster.\n\nThe trap: research preview ≠ **irresponsible preview**.\n\nYou can’t throw junk over the wall and blame “early access” for UX debt.\n\nA healthy preview still needs three guardrails: **bounded blast radius, risk can be turned off or rolled back, and users know what stage they’re in.**\n\n**The promise can be lighter. Safety work cannot.**\n\nThat’s why “less process idle time” only works alongside **clearer rules** — fewer approvals, not fewer checks; thinner docs, not thinner goals; earlier ship, but still flags, monitoring, and rollback.\n\nCat noted something representative: many PMs on Claude Code have engineering backgrounds; designers ship frontend code. Product, engineering, and design are **less siloed**.\n\nThat trend will accelerate.\n\nWhen code is cheap, work that existed only to **hand off** starts to feel wasteful.\n\nPMs used to validate an idea via spec → design → eng queue → wait. Now a PM who codes can spike a prototype; an engineer with product sense can fix interaction gaps; a designer who ships UI can get to something runnable.\n\nThat doesn’t mean everyone becomes full-stack or specialties vanish.\n\n**Role is default responsibility — not the ceiling on what you can do.**\n\nYou might be a PM — but can you prototype when the team needs it? You might be an engineer — but can you flag when the problem framing is wrong?\n\nThe scarce skill isn’t typing code. It’s **product taste**.\n\nWhen everything is buildable, the expensive question becomes:\n\n**Of all the things we could build, which ones should exist?**\n\nThat’s what Cat means when she says that as code gets cheaper, **choosing what to write** gets more valuable.\n\nWithout taste, AI teams do something worse than traditional teams:\n\n**They ship the wrong things faster.**\n\nBad ideas used to die on engineering cost. Not anymore — agents will diligently implement them.\n\nIn AI-native teams, PM value isn’t scheduling or chasing status. It’s picking the highest-leverage slice, defining a **small but real** unit of ship, and separating noise from signal in early feedback.\n\nThat’s harder than “writing good prompts.”\n\nOne of the harder points in the interview: many AI product people **build for a future super-model**, not today’s model.\n\n“Models will catch up — ship the scrappy version; AGI will fix it.”\n\nDangerous.\n\nThe hard job is **shipping for the model you have now**:\n\nKnow where it’s strong and weak, what you can delegate, what needs guardrails, what needs external memory, task lists, or human takeover.\n\nEarly Claude Code big refactors sometimes stalled mid-flight — so the team added **todo lists** so work was decomposed and tracked. As models improved, some of that scaffolding could fade.\n\nCat’s line: **the model eats your adaptation layer.**\n\n**A lot of product design that feels essential today will matter less tomorrow** as models improve.\n\nTwo buckets:\n\n**1. Model-gap scaffolding** — todos, forced continuation, verifiers when the model won’t self-check. **Let these retire** as capability catches up.\n\n**2. Task-structural constraints** — permissions, release policy, rollback, audit trails, user promises, gradual rollout, UX commitments.\n\nThose don’t vanish because the model got smarter — strong human engineers don’t eliminate permission systems either.\n\nComplexity often comes from mixing the buckets: scaffolding that should expire gets cast in concrete, while structural safety gets waved away as “the model will figure it out.” Both hurt.\n\nMapped to practice, I’d break “daily ship” into something concrete:\n\nStep 1 — Intake: No novella. Goal, non-goals, principles, risks, acceptance.\n\nStep 2 — Build: Agent works in an isolated worktree or sandbox — read, edit, run tests, add tests.\n\nStep 3 — Evidence, not “I’m done”: Files touched, entry points affected, tests run / not run, touches on auth/data/money/security, flag needed?, rollback path.\n\nStep 4 — Review by risk tier: Low risk → lean on automation + evidence completeness. Medium/high → human eyes on architecture, UX, security.\n\nStep 5 — Merge without blind exposure: Default behind a flag; internal environments first when possible.\n\nStep 6 — Release train: A fixed daily window for changes that are merged, verified, and within risk appetite.\n\nStep 7 — Observe: Error rates, latency, key conversion, user feedback, log anomalies — not “we shipped, done.”\n\nStep 8 — Incidents: Flip the flag or roll back first; don’t spend thirty minutes in a blame meeting.\n\nAgents matter. **Clear rules at each gate matter more.**\n\nI call that bundle a Release Harness — not a single tool, but constraints on how work is sliced, how evidence is submitted, how risk is tiered, what must be automated, what must stay human, and when merge / ship / rollback is allowed.\n\nA minimal release checklist is seven fields: **goal, scope, risk tier, verification evidence, ship method, rollback method, watch metrics**. The point isn’t a pretty template — it’s that every PR can answer them.\n\nWithout that, “one release a day” becomes **a daily merge of opaque AI diffs** with users as QA.\n\nThat isn’t leverage. That’s **risk moved downstream again**.\n\nCommon mistake.\n\nDaily ship means **small changes can enter production safely each day** — not a finished epic every sunset.\n\nBig bets should decompose: schema → API → internal entry → flag → user-visible UI. Each step can ship; not each step must be user-visible.\n\nIf asks stay huge and agents only accelerate implementation, you get **bigger PRs, harder review, nastier integration**.\n\nAI amplifies how well you slice work. Slice well → parallel progress. Slice badly → a fast, scary monolith.\n\n“Here’s the full ticket — do all of it” works in demos. In production, it’s risky.\n\nAgents finish local tasks well; they don’t automatically know your release bar. They may refactor the wrong layer, add a “reasonable” compatibility shim, and bundle UI, API, tests, and config into one diff.\n\nThe PR looks productive. Reviewers suffer.\n\nI prefer task types:\n\nExplore — read-only, options, no edits\n\nImplement — bounded scope, tests required\n\nVerify — reviewer mindset, hunt risks\n\nFix — confirmed issues only, no scope creep\n\nMore stable than one agent from start to finish.\n\nWorse and common: unit tests pass, the PR claims “done,” but **no one walked the real user path** — flag missing in staging, menu hidden, role matrix gap, field mismatch on the live API.\n\n[AgentField’s writeup on ~200 autonomous agents writing production code](https://agentfield.ai/blog/beyond-vibe-coding) describes the same gap: parallel agents can leave every issue green while the merged product still fails — tests passed with mocked dependencies, every acceptance criterion met while a module stayed invisible to consumers. The system optimizes for the criteria and checks you encode, not for coherence you never specified.\n\nGreen means covered paths didn’t explode. Not the product entry works.\n\nEvidence needs an integration slice: where’s the entry, routes, config, permissions, analytics, end-to-end path.\n\nWithout that, you ship a half-plugged-in feature.\n\n“We want high frequency” without feature flags doesn’t add up.\n\nNo flags → unfinished work can’t live on `main`\n\n→ long-lived branches → merge pain → release frequency drops.\n\nWith agents touching many surfaces at once, branch merges get uglier.\n\nFlags aren’t just “gradual rollout.” They let incomplete capability exist safely on `main`\n\n.\n\nCat’s advice for individuals applies to teams: if automation isn’t reliable, it isn’t automation — it’s a new chore.\n\n95% correct release notes that omit risks → humans re-read every PR.\n\nAuto-fix CI that patches wrong → humans re-audit every patch.\n\nAuto-verify that false-passes → humans redo manual QA.\n\n**The last 5% has to be trustworthy enough to depend on.**\n\nNot every team should ship daily.\n\nCore infrastructure, regulated financial flows, heavy compliance — **release cadence isn’t the only metric.**\n\nAnthropic’s practice is useful because it surfaces a sharper question:\n\n**After AI made code fast, did your feedback loop get fast?**\n\nIf not, you mostly get more half-done work, more PRs, more verification load, more integration risk.\n\nIf yes, the shape of the team changes: clearer goals, thinner docs, looser handoffs, faster path to trunk, hidden-by-default features, evidence that accumulates automatically, rollback that isn’t improvised.\n\nThen AI becomes **organizational R&D capacity**, not a typing sidecar.\n\nWhen I gauge maturity, I don’t count agents, LOC, or open PRs.\n\nI ask three questions at end of day:\n\nWhat actually reached production today?\n\nWhy was that safe?\n\nIf we were wrong, how fast can we revert?\n\nTeams that answer those three consistently are the ones that can talk about shipping every day without gambling.\n\nApplication delivery is only half the picture when agents touch **schema, tenants, replication, or ops runbooks**. The same rules apply: thin goals, explicit evidence, flags or staged rollout, and rollback you have practiced — not “the agent said the cluster is fine.”\n\nAt **OceanBase**, we see the same bottleneck shift in the open-source community: AI speeds up how people generate deploy scripts and config, but production still depends on verification, integration, and recovery — especially for distributed databases where a green unit test does not prove a safe cutover.\n\nIf you are experimenting with agent-assisted database work, three places to start:\n\n[oceanbase-skills](https://github.com/oceanbase/oceanbase-skills) (including [oceanbase-deploy](https://github.com/oceanbase/oceanbase-skills/tree/master/skills/oceanbase-deploy)) that wrap deployment, tenant management, and benchmarks in governed, repeatable flows — not one-off prompts.\n\n[OceanBase documentation](https://en.oceanbase.com/docs) — canonical steps for install, upgrade, and operations so agents and humans share the same source of truth.\n\nContributions and feedback — open an issue or PR on [GitHub](https://github.com/oceanbase/oceanbase) if you are hardening a Release Harness that includes the data plane; we are interested in what breaks when coding gets 10× faster but cluster validation does not.\n\n**Your move**: Pick one change you shipped (or almost shipped) in the last month. Ask whether evidence covered app + data + rollback — not just “tests passed.” If the data path was hand-waved, that is your downstream bottleneck.", "url": "https://wpnews.pro/news/ai-writes-code-faster-why-hasnt-delivery", "canonical_source": "https://dev.to/seekdb/ai-writes-code-faster-why-hasnt-delivery-3fgj", "published_at": "2026-05-27 15:59:00+00:00", "updated_at": "2026-05-27 16:11:46.371635+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-products", "ai-tools", "ai-agents", "generative-ai"], "entities": ["Cat Wu", "Anthropic", "OceanBase"], "alternates": {"html": "https://wpnews.pro/news/ai-writes-code-faster-why-hasnt-delivery", "markdown": "https://wpnews.pro/news/ai-writes-code-faster-why-hasnt-delivery.md", "text": "https://wpnews.pro/news/ai-writes-code-faster-why-hasnt-delivery.txt", "jsonld": "https://wpnews.pro/news/ai-writes-code-faster-why-hasnt-delivery.jsonld"}}