AI Writes Code Faster. Why Hasn’t Delivery?

AI code generation has dramatically accelerated the coding step, but software delivery speeds remain bottlenecked by review, testing, CI/CD, and release processes that haven't kept pace. According to an engineer at OceanBase, when only the "writing code" step speeds up while validation, integration, and recovery stay the same, organizations simply accumulate more PRs waiting for review and features waiting to ship rather than achieving true 10× delivery. The real shift requires changing how teams set goals, verify code, ship features, and contain risk — otherwise AI multiplies congestion downstream rather than actual delivery velocity.

The bottleneck didn’t disappear — it moved downstream. When coding gets 10× faster but review, CI, and release don’t, you don’t get 10× delivery. You get a longer queue behind the keyboard. If your stack includes a database layer, that queue often shows up twice — once in application CI, again in migrations, backups, and failover. The closing section links how we think about that at OceanBase. I recently watched an interview with Cat Wu https://www.youtube.com/watch?v=PplmzlgE0kg on how Anthropic’s product team went from shipping a feature every few months to every few weeks, sometimes days, and for small slices of work — even within a single day. Our team has been having a parallel conversation: what concrete practices actually turn AI speed into delivery speed? My main takeaway isn’t “AI writes code scary fast.” That’s the shallow read. Lots of teams now claim delivery is 10× or 50× faster with AI. I’m skeptical. Because one thing gets conflated all the time: Faster code generation is not the same thing as faster software delivery. An agent can draft a patch in ten minutes. Sure. But whether that patch can land on main , be validated, reach real users, and be debugged or rolled back when it breaks—that’s a different system entirely. If only the “writing code” step speeds up while review, testing, release, monitoring, and rollback stay the same, “50× faster” is often a local illusion: one part of the pipe got hot; the org is still stuck. Most teams, in my view, aren’t at the “quantity becomes quality” inflection yet. The real shift isn’t “everyone runs more coding agents.” It’s that how you set goals, verify code, ship features, and contain risk has to change. Otherwise AI doesn’t multiply delivery — it multiplies PRs waiting for review, features waiting for validation, and branches waiting to merge . As CircleCI https://circleci.com/landing-pages/assets/2026-state-of-software-delivery-report.pdf puts it, success in the AI era is “no longer determined by how fast code can be written” — the decisive factor is whether you can “validate, integrate, and recover at scale.” For most teams, that’s a sharper framing than asking whether AI can generate code at all. What’s more interesting: Cat has said outright that new internal models weren’t the main driver of faster iteration. The lever was process — how goals are set, how docs are written, how previews ship, how cross-functional work runs, and who has authority to put something in front of users. Shipping a feature a day isn’t about whether the model can code. It’s about whether the org removed what used to slow releases down. Here’s the thesis up front: When code gets cheap, the expensive thing becomes judgment. Judgment about what’s worth building, how good is good enough to ship, where a human must decide, and where an agent can run to completion. So “AI-native” speed isn’t “everyone uses AI to write code.” Two things have to happen together: Less idle motion in the process. More clarity in the rules. Less idle motion means fewer documents, handoffs, approvals, and waits that exist only because “that’s how we’ve always worked.” More clarity means goals, evidence, verification, permissions, release, and rollback are specified — not reinvented in every hallway conversation. Only when both are true does “one release per day” stop being a gamble. Seeing Anthropic ship often, many people jump to: strong models, engineers on Claude Code, therefore fast coding. That helps. It isn’t the main story. If coding is the only thing that speeds up, what do you get? More PRs waiting for review More features waiting for validation More half-finished branches More production risk More arguments about whether this is shippable at all That’s not faster delivery. It’s congestion moved downstream from typing to everything after typing. A routine feature might have taken days from kickoff to first PR. Now a crisp small ask can get a first diff in minutes. But “code exists” is a short leg of the journey. After that comes a whole chain: Should we build this at all? Is this the right shape? Who owns blast radius? Is test evidence enough? Can we spin a preview? Should this sit behind a flag? How do we roll back? When coding was slow, that chain was masked by coding time . Code in thirty minutes, CI still twenty, review still half a day, QA still queued — and the backlog becomes visible. Qovery’s piece on AI and DevOps in 2026 https://www.qovery.com/blog/ai-devops-2026-cicd-pipeline-bottleneck makes the same point from the platform side: when AI coding tools explode code throughput, CI/CD, environment provisioning, and deployment pipelines — not typing speed — become the constraint. As they put it, the bottleneck has flipped : less time coding, more time waiting on builds, previews, and deploys. So I’ve stopped framing R&D efficiency as lines of code per hour . A better line: AI compresses the coding segment and forces you to face the real system bottleneck. The lesson from Anthropic’s public interviews isn’t simply “they had a better model.” Cat Wu has said internal model use raised shipping speed only “a little bit,” with “the bulk of the increase” coming from process and team expectations. When code gets cheaper to write, both Wu and Mike Krieger describe bottlenecks shifting — to deciding what to build, merge queues, CI, and the other steps that used to hide behind slow coding. In classic product development, the PRD product requirements document often functions as comfort food . Longer feels more professional. More edge cases upfront feels more in control. Before engineering starts, you get fifteen pages of background, scope, flows, exceptions, competitors, and timeline. In AI-native teams, that pattern ages badly. Not because docs don’t matter — because long docs often disguise decisions as description . What matters isn’t pre-specifying every button and edge case. It’s being explicit about goal, principles, and how you’ll know it worked . Cat’s interview: routine features don’t need a novel-length PRD; large infrastructure bets still might. Many features need one page — goal, principles, metrics — and then people with context make local calls. That’s a real role change. PM value used to be: “I specified it completely; engineering executes.” Now it’s closer to: “I made the goal and tradeoffs legible so people with context can decide quickly.” That matters more with agents in the loop. If every micro-decision waits on PM, you’re back to pre-AI cadence. If the goal is fuzzy, agents and humans will sprint in the wrong direction together. A one-pager isn’t laziness. It should answer: Goal What user problem are we improving? Non-goals What are we explicitly not solving this round? Principles When tradeoffs appear, what wins? Success signals What observable outcomes mean "keep going"? Risks What must never be touched without a human? What needs explicit sign-off? That beats twenty pages of “looks complete” for agents and engineers alike. Agents don’t need literary prose — they need bounds and decision rules . Engineers don’t need a script — they need to know where they can decide alone. Another Anthropic habit: lots of capabilities land first as research previews . That can sound like “ship half-baked work.” It isn’t lowering quality — it’s narrowing the promise . Why do traditional teams ship slowly? Every release carries a hidden contract: stable, complete, friendly to all users, documented, consistent, no obvious landmines. That contract is heavy. So teams enter wait mode: one more edge case, one more design pass, one more test round, one more stakeholder sync. Six months later, users still haven’t touched it. Research previews change expectations upfront : This is early. We’re still finding the shape. You can try it — it isn’t the final product. Three effects: You don’t have to wait for perfect. Real user signal arrives earlier. Bad product directions fail faster. The trap: research preview ≠ irresponsible preview . You can’t throw junk over the wall and blame “early access” for UX debt. A healthy preview still needs three guardrails: bounded blast radius, risk can be turned off or rolled back, and users know what stage they’re in. The promise can be lighter. Safety work cannot. That’s why “less process idle time” only works alongside clearer rules — fewer approvals, not fewer checks; thinner docs, not thinner goals; earlier ship, but still flags, monitoring, and rollback. Cat noted something representative: many PMs on Claude Code have engineering backgrounds; designers ship frontend code. Product, engineering, and design are less siloed . That trend will accelerate. When code is cheap, work that existed only to hand off starts to feel wasteful. PMs used to validate an idea via spec → design → eng queue → wait. Now a PM who codes can spike a prototype; an engineer with product sense can fix interaction gaps; a designer who ships UI can get to something runnable. That doesn’t mean everyone becomes full-stack or specialties vanish. Role is default responsibility — not the ceiling on what you can do. You might be a PM — but can you prototype when the team needs it? You might be an engineer — but can you flag when the problem framing is wrong? The scarce skill isn’t typing code. It’s product taste . When everything is buildable, the expensive question becomes: Of all the things we could build, which ones should exist? That’s what Cat means when she says that as code gets cheaper, choosing what to write gets more valuable. Without taste, AI teams do something worse than traditional teams: They ship the wrong things faster. Bad ideas used to die on engineering cost. Not anymore — agents will diligently implement them. In AI-native teams, PM value isn’t scheduling or chasing status. It’s picking the highest-leverage slice, defining a small but real unit of ship, and separating noise from signal in early feedback. That’s harder than “writing good prompts.” One of the harder points in the interview: many AI product people build for a future super-model , not today’s model. “Models will catch up — ship the scrappy version; AGI will fix it.” Dangerous. The hard job is shipping for the model you have now : Know where it’s strong and weak, what you can delegate, what needs guardrails, what needs external memory, task lists, or human takeover. Early Claude Code big refactors sometimes stalled mid-flight — so the team added todo lists so work was decomposed and tracked. As models improved, some of that scaffolding could fade. Cat’s line: the model eats your adaptation layer. A lot of product design that feels essential today will matter less tomorrow as models improve. Two buckets: 1. Model-gap scaffolding — todos, forced continuation, verifiers when the model won’t self-check. Let these retire as capability catches up. 2. Task-structural constraints — permissions, release policy, rollback, audit trails, user promises, gradual rollout, UX commitments. Those don’t vanish because the model got smarter — strong human engineers don’t eliminate permission systems either. Complexity often comes from mixing the buckets: scaffolding that should expire gets cast in concrete, while structural safety gets waved away as “the model will figure it out.” Both hurt. Mapped to practice, I’d break “daily ship” into something concrete: Step 1 — Intake: No novella. Goal, non-goals, principles, risks, acceptance. Step 2 — Build: Agent works in an isolated worktree or sandbox — read, edit, run tests, add tests. Step 3 — Evidence, not “I’m done”: Files touched, entry points affected, tests run / not run, touches on auth/data/money/security, flag needed?, rollback path. Step 4 — Review by risk tier: Low risk → lean on automation + evidence completeness. Medium/high → human eyes on architecture, UX, security. Step 5 — Merge without blind exposure: Default behind a flag; internal environments first when possible. Step 6 — Release train: A fixed daily window for changes that are merged, verified, and within risk appetite. Step 7 — Observe: Error rates, latency, key conversion, user feedback, log anomalies — not “we shipped, done.” Step 8 — Incidents: Flip the flag or roll back first; don’t spend thirty minutes in a blame meeting. Agents matter. Clear rules at each gate matter more. I call that bundle a Release Harness — not a single tool, but constraints on how work is sliced, how evidence is submitted, how risk is tiered, what must be automated, what must stay human, and when merge / ship / rollback is allowed. A minimal release checklist is seven fields: goal, scope, risk tier, verification evidence, ship method, rollback method, watch metrics . The point isn’t a pretty template — it’s that every PR can answer them. Without that, “one release a day” becomes a daily merge of opaque AI diffs with users as QA. That isn’t leverage. That’s risk moved downstream again . Common mistake. Daily ship means small changes can enter production safely each day — not a finished epic every sunset. Big bets should decompose: schema → API → internal entry → flag → user-visible UI. Each step can ship; not each step must be user-visible. If asks stay huge and agents only accelerate implementation, you get bigger PRs, harder review, nastier integration . AI amplifies how well you slice work. Slice well → parallel progress. Slice badly → a fast, scary monolith. “Here’s the full ticket — do all of it” works in demos. In production, it’s risky. Agents finish local tasks well; they don’t automatically know your release bar. They may refactor the wrong layer, add a “reasonable” compatibility shim, and bundle UI, API, tests, and config into one diff. The PR looks productive. Reviewers suffer. I prefer task types: Explore — read-only, options, no edits Implement — bounded scope, tests required Verify — reviewer mindset, hunt risks Fix — confirmed issues only, no scope creep More stable than one agent from start to finish. Worse and common: unit tests pass, the PR claims “done,” but no one walked the real user path — flag missing in staging, menu hidden, role matrix gap, field mismatch on the live API. AgentField’s writeup on ~200 autonomous agents writing production code https://agentfield.ai/blog/beyond-vibe-coding describes the same gap: parallel agents can leave every issue green while the merged product still fails — tests passed with mocked dependencies, every acceptance criterion met while a module stayed invisible to consumers. The system optimizes for the criteria and checks you encode, not for coherence you never specified. Green means covered paths didn’t explode. Not the product entry works. Evidence needs an integration slice: where’s the entry, routes, config, permissions, analytics, end-to-end path. Without that, you ship a half-plugged-in feature. “We want high frequency” without feature flags doesn’t add up. No flags → unfinished work can’t live on main → long-lived branches → merge pain → release frequency drops. With agents touching many surfaces at once, branch merges get uglier. Flags aren’t just “gradual rollout.” They let incomplete capability exist safely on main . Cat’s advice for individuals applies to teams: if automation isn’t reliable, it isn’t automation — it’s a new chore. 95% correct release notes that omit risks → humans re-read every PR. Auto-fix CI that patches wrong → humans re-audit every patch. Auto-verify that false-passes → humans redo manual QA. The last 5% has to be trustworthy enough to depend on. Not every team should ship daily. Core infrastructure, regulated financial flows, heavy compliance — release cadence isn’t the only metric. Anthropic’s practice is useful because it surfaces a sharper question: After AI made code fast, did your feedback loop get fast? If not, you mostly get more half-done work, more PRs, more verification load, more integration risk. If yes, the shape of the team changes: clearer goals, thinner docs, looser handoffs, faster path to trunk, hidden-by-default features, evidence that accumulates automatically, rollback that isn’t improvised. Then AI becomes organizational R&D capacity , not a typing sidecar. When I gauge maturity, I don’t count agents, LOC, or open PRs. I ask three questions at end of day: What actually reached production today? Why was that safe? If we were wrong, how fast can we revert? Teams that answer those three consistently are the ones that can talk about shipping every day without gambling. Application delivery is only half the picture when agents touch schema, tenants, replication, or ops runbooks . The same rules apply: thin goals, explicit evidence, flags or staged rollout, and rollback you have practiced — not “the agent said the cluster is fine.” At OceanBase , we see the same bottleneck shift in the open-source community: AI speeds up how people generate deploy scripts and config, but production still depends on verification, integration, and recovery — especially for distributed databases where a green unit test does not prove a safe cutover. If you are experimenting with agent-assisted database work, three places to start: oceanbase-skills https://github.com/oceanbase/oceanbase-skills including oceanbase-deploy https://github.com/oceanbase/oceanbase-skills/tree/master/skills/oceanbase-deploy that wrap deployment, tenant management, and benchmarks in governed, repeatable flows — not one-off prompts. OceanBase documentation https://en.oceanbase.com/docs — canonical steps for install, upgrade, and operations so agents and humans share the same source of truth. Contributions and feedback — open an issue or PR on GitHub https://github.com/oceanbase/oceanbase if you are hardening a Release Harness that includes the data plane; we are interested in what breaks when coding gets 10× faster but cluster validation does not. Your move : Pick one change you shipped or almost shipped in the last month. Ask whether evidence covered app + data + rollback — not just “tests passed.” If the data path was hand-waved, that is your downstream bottleneck.