Why Coding Stays in Human-AI Collaboration: A Paradox in Stanford's 51 Deployments

wpnews.pro

"We rolled out AI and saw no results" and "AI made our development dramatically faster" are being said in the same year, often inside the same company. Where does that gap come from?

Stanford Digital Economy Lab's The Enterprise AI Playbook: Lessons from 51 Successful Deployments (April 2026) goes after that question with real data. It analyzes 51 production deployments across 41 organizations and 9 industries, drawing on structured interviews and internal documents to separate what made deployments succeed from what made them fail.

Most of the coverage so far reads the report from a management angle: AI adoption as an organizational-change problem, the importance of process redesign and executive commitment. That framing is accurate. But the report also spans customer support, software engineering, marketing, and more, and there is plenty in there about software engineering that the management-focused takes barely touch.

Read it with an engineer's eye and one paradox jumps out. While customer support and IT operations move toward autonomous AI, coding alone stays in "human-AI collaboration." That runs against the prevailing mood that "AI coding is the frontier."

This post starts from that paradox. First I'll walk through the report's method and key findings, then analyze the structure that keeps coding in collaboration, and finally re-read the 51 cases from three vantage points: the individual engineer, the engineering lead, and whoever owns org-level development. I'll stay close to the report's findings and then push past them.

A quick look at the study first, so the interpretation later lands properly.

The authors are Elisa Pereira, Alvin Wang Graylin, and Erik Brynjolfsson. Brynjolfsson is one of the most-cited researchers in the economics of information, known for early work measuring the productivity effects of IT investment. The "Productivity J-Curve" that Brynjolfsson and colleagues laid out in 2021 is one of the foundations of this study.

The J-Curve goes like this. A general-purpose technology like AI doesn't raise productivity just by being deployed. It needs complementary investment in intangibles: process redesign, training, reorganization. During that investment, productivity actually dips. Only once the investment pays off does productivity jump. The curve dips into a trough before springing up, hence the "J." The report's recurring message that organization matters more than technology rests on this premise.

The study only looks at deployments that moved past the pilot stage into production and produced measurable business value. The selection criteria were:

Interviews ran from August 2025 to February 2026, with at least one 60-minute structured interview per company, supplemented by internal metrics, project plans, and financial documents. The sample skews toward manufacturing, financial services, and technology.

The conclusion is simple. Using the same technology for the same purpose, outcomes varied widely by organization. What made the difference was not the AI model. It was how prepared the organization was, what processes it had, how its leaders engaged, and whether it had a culture that tolerated failure.

The findings most relevant to an engineering organization:

Worth stressing: this is a study of successful deployments only. The report is explicit about the selection bias. Companies were asked about past failures and abandoned pilots too, but what ends up analyzed are the cases that created value.

So this study shows "what success looks like and what it takes to get there," not "how common success is." The report cites MIT's NANDA initiative study from 2025, "The GenAI Divide: State of AI in Business 2025" (which reported that 95% of generative-AI pilots produced no measurable financial impact), and positions itself as the inverse: a deep look at the side that succeeded. Read it with that asymmetry in mind.

Here's the core. Chapter 3 of the report has a table organizing human-in-the-loop (HITL) involvement by business function. Read it as an engineer and the table feels off.

The report splits HITL into three levels:

Autonomy is highest at escalation and lowest at collaboration. By function:

Function	HITL level	Median productivity gain
IT operations	Escalation	90%
Customer support	Escalation	71%
Claims processing	Escalation	50%
Field service	Approval	80%
Clinical documentation	Approval	66%
Coding	Collaboration	54%

(from Chapter 3, "How much human oversight is optimal?")

Coding is the only function in the collaboration tier. Clinical documentation sits in approval because medical records are legal documents a physician has to sign off on, one by one. Claims processing and customer support can move to escalation because they're high-volume, have clear success criteria, and tolerate recoverable mistakes.

So why does coding stay in collaboration? No regulation pins AI down here. And yet humans and AI keep working task by task.

The report describes the change on the coding floor like this: rather than completing a whole task themselves, engineers increasingly review AI-generated changes, make small adjustments, and merge the PR. At one Latin American fintech, AI agents migrated millions of lines of legacy code in a system serving 100M+ customers, compressing work originally estimated at 18 months and 1,000+ people into a few weeks. At an insurer, a legacy rebuild scoped at 5,000 hours, 7 people, and a 2027 finish was done in 600 hours with 3 people.

So coding isn't "not getting faster." The role moved from writing to reviewing, and productivity is up 54%. It just hasn't reached the full autonomy other functions have. There's a structural reason.

The report lists four conditions under which agentic AI delivers:

Hold coding up against these four and the reason it stays in collaboration comes into view.

Procurement and alert triage cleanly satisfy all four. High volume, a clear right/wrong, recoverable mistakes. So they move to full autonomy.

Coding? Routine refactors, test generation, dependency bumps tend to satisfy the four. But feature work and architectural change break them. "Tests pass" isn't a sufficient success criterion when readability, maintainability, and fit with existing design are in play. And production migrations or schema changes can produce unrecoverable errors. Two of the conditions, "clear success criteria" and "recoverable errors," fail across a wide swath of coding.

Per the METR measurements the report cites (METR is a research org that measures AI autonomy), the length of software tasks frontier models can complete autonomously has been doubling roughly every 7 months, reaching about 15 human-expert-hours in early 2026. Anthropic, meanwhile, warns that around the 3.5-hour mark, API success rates drop below 50%. Coding agents that run autonomously for days and emit tens of thousands of lines are no longer rare, but production reliability falls off as tasks get longer and more complex. That's exactly why the human involvement of engineer review still governs quality.

One step further. Coding stays in collaboration not because AI is weak or engineers are behind, but because software engineering already has a deeply layered culture of verification.

Type systems, unit tests, code review, CI, static analysis, canary releases. Engineering spent decades building a culture that distrusts even human-written code and puts it through layers of verification before production. Adding review on top of AI-written code is the most natural extension of that culture. The flip side: full autonomy (escalation) collides head-on with that verification culture. "Only a human reviews 20% of samples" works for alert triage, but "review only 20%" against production code runs against most engineers' instincts. Coding stays in collaboration partly as a technical limit and partly because engineering, as a discipline, is built around verification.

Seen this way, HITL level isn't a simple matter of "as models improve, things automatically advance to escalation." The stronger the verification culture in a domain, the longer collaboration persists. Coding's path to autonomy depends not just on model performance but on how much of the verification you can hand to AI itself, specifically, how you design the layer where AI writes tests and AI reviews. Whether review itself can be handed to AI is something I'll come back to later.

What coding-stays-in-collaboration means depends on where you stand. Three vantage points: the individual engineer, the engineering lead, and whoever owns org-level development. The same study hands each a different assignment.

The change the report describes means an individual engineer's daily work is already shifting. Instead of writing from scratch, you read AI-generated changes, judge them, adjust, and merge. Time spent writing code gives way to time spent evaluating code.

This is where the nature of the collaboration model bites. Under escalation, a human only looks at exceptions. Under collaboration, human judgment governs quality on every single output. Let review get sloppy and defects in generated code flow straight to production.

The awkward part: review gets harder as AI gets better. The more "plausible" the output looks, the more humans skip the details. Automation complacency, the long-known phenomenon in aviation and process industries where over-trusting an automated system erodes attention, shows up in code review too. Obviously wrong code is easy to catch; code that's 90% right with a subtle 10% wrong slips past a skim. Collaboration's 54%, the most modest gain among the functions, can be read partly as that "cost of review" offsetting the productivity gain.

The question for the individual engineer is where to draw the line between what to delegate to AI and what to keep under your own judgment. The data says coding is still in the collaboration stage, meaning human judgment is directly tied to quality. Not taking AI output at face value, not treating review as a formality, these become preconditions for productivity at this stage. The direction of skill shifts too. More than the ability to write code from zero, the ability to spot defects in code others (or AI) wrote, and to read the intent behind a design, grows relatively more important.

For a lead, the job becomes designing which development tasks run at which HITL level. The report frames HITL choice as determined by error tolerance, regulatory requirements, and task complexity. That maps directly onto a team's design guidance. The four conditions for agentic success (high-volume/repetitive, clear success criteria, recoverable errors, multi-system data access) double as a way to sort development tasks. Tied to practice, roughly:

Leave that sorting vague and pick "let AI do everything" or "humans review everything," and the former invites incidents while the latter caps productivity. Varying the autonomy level by the nature of the task is where a lead earns their keep.

There's another important finding. In 42% of cases model choice was interchangeable, and the durable competitive advantage was not the underlying large model (the foundation model) itself but the design of how you combine and use it, the orchestration layer. Which model you use is becoming a commodity for many use cases. What separates teams is how you decompose tasks, where you insert human involvement, and how you wire up multiple tools and data sources. In the report's words, the advantage isn't the model but how you compose it.

This is actually good news for the people doing the designing. You don't have to win the race of chasing the latest model; you can build an edge on the quality of task decomposition, HITL design, and tool integration. How you build the verification layer, the multi-stage scheme where AI writes tests and AI reviews, is exactly this orchestration-layer question, and it's the key to moving coding from collaboration to the next stage.

At the org-development level, the question is what happens after productivity rises. The report shows, with concrete cases, that what you do with the freed capacity is decided by organizational choice, not technology.

The report lays out three strategies for that capacity:

At one PE-owned company, an 88% productivity gain in coding led to cutting the development team from 7 to 3. At an edtech company, a 20-30% improvement in coding went not to layoffs but to accelerating the roadmap: with a large product backlog, shipping features faster was worth more than trimming the team. The report notes growth-stage companies lean toward acceleration, while cost-focused ownership (PE, turnaround) leans toward reduction. The same productivity gain can tip toward reduction or acceleration. What decides is not technology but organizational strategy.

A second case, security operations, reads as a model of redeployment. A 6-person SOC (security operations center) team at one tech company was buried under 1,500 alerts a month. After automating first-pass triage with AI, the required headcount dropped to the equivalent of 1.5 people, but no one was laid off. The freed 4.5 FTE were redeployed to proactive threat hunting, security-design review, and team skill-building. The executive who led it put it this way: "AI isn't replacing the person you have; it's replacing the person you don't need to hire." In areas like SRE and security, chronically understaffed with a backlog of "things we want to do but can't," redeployment rather than reduction is the natural choice.

Org-development also has to think about how to bring cautious departments along. Per the report, the most cautious about AI adoption were not frontline users (23%) but staff functions like legal, HR, risk, and compliance (35%). Different stances care about different things: executives want ROI you can see in numbers, staff functions worry about procedural risk and where the blame lands, and the frontline fears losing their jobs. Each needs a different move. Spreading AI inside an engineering org likewise means looking past the dev team's own walls. How you bring legal, security, and HR along shapes how fast you can roll out. The report has several cases where handing those staff functions a governance role turned the cautious departments into active champions.

Executive involvement comes in stages too, the report says. Hands-on engagement, checking progress weekly and clearing blockers, accounted for 58% of the successful cases. The 7 cases that reached company-wide transformation all wired AI adoption into a corporate OKR and tied it to evaluation and compensation. For an engineering leader as well, the condition for making it stick is to connect AI use to organizational goals rather than leaving it as a "clever trick on the floor." And one more thing the report stresses repeatedly: a culture that doesn't punish failure. 61% of successful cases had a prior failure, and in none of the cases studied was anyone punished for a failed AI project. Since putting AI into production presupposes trial and error, building a culture that can forgive failure is one of the most important jobs at the org-development level.

I've walked through the report's findings. But what's in the report isn't enough on its own. To carry it into practice, a few things need reading-in.

First, selection bias. As noted, this is a study of successful deployments. The report itself admits it doesn't show "how common success is." Take it as a description of patterns shared by organizations that succeeded, not a guarantee that the same approach will work. MIT's "95% fail" number and this study's "51 that succeeded" are the same phenomenon seen from opposite sides. Only by overlaying both do you get the full picture.

Second, the handling of reliability. As one critique points out, the report claims "messy data isn't a blocker if you design around it," yet flags reliability problems in 27% of cases while never once using the word "hallucination."

To an engineer, "you can design around it" doesn't quite hold for messy data and model instability. In coding especially, hallucination shows up as "plausible but wrong code" that can slip past review. Take the report's view on board, but also look hard and soberly at your own data quality and model reliability. The "vague success criteria" and "unrecoverable errors" I named earlier as reasons coding stays in collaboration are, in fact, another face of this same reliability problem.

Third, the time axis. The report's data collection ran from late 2024 to early 2025, when agentic AI was still nascent, and at the time of the study agentic implementations were only 20% of cases. The report itself notes that the redeployment and hiring-freeze patterns observed here are characteristics of an early-adoption phase, and that the distribution may shift as models mature and cost pressure builds. It also references separate Brynjolfsson-et-al. research showing that employment of younger workers in AI-exposed roles has already declined in relative terms, and warns this is an early sign of a larger shift. The coding-stays-in-collaboration picture here is likewise not fixed; read it as a snapshot of this moment. Given how fast the length of tasks AI can complete autonomously is growing (per METR), the boundaries should keep moving, from collaboration to approval, from approval to escalation.

Finally, separate from those three points, let me answer an objection likely to be aimed at this post's own argument. Even if coding stays in collaboration, why not just hand review to AI too? In fact, AI code review has spread fast, and AI now catches style violations and common bugs. For low-risk changes, some teams already run on AI review alone.

But having AI review AI-written code has a trap. The same model tends to share the same blind spots, so "plausible but wrong code" can be missed by both the author-AI and the reviewer-AI. The automation complacency from earlier stacks up, multiplied, between AIs. The one who ultimately decides the merge and owns the soundness of the design is, for now, a human. So even as AI review advances, it's more realistic to expect that human review won't go to zero but will narrow to the high-risk areas. Low-risk changes go to AI; areas carrying unrecoverable risk stay with humans. That lines up exactly with this post's view: vary the HITL level by task.

Coding stays in human-AI collaboration not because the technology is immature or engineers are slow to change. Task complexity, the risk of unrecoverable errors, and the fact that software engineering already has a multi-layered culture of verification: these three overlap to keep coding at a stage where human judgment still governs quality.

That fact hands each vantage point a different assignment. The individual engineer: the responsibility to keep engaging with review that gets harder as AI advances. The lead: the role of deciding which task sits at which autonomy level and how to design the verification layer. The org-development owner: the work of choosing, as strategy, what to do with the freed capacity, and of building a culture that forgives failure plus org-wide adoption that sticks.

What Stanford's 51 cases show again and again is one thing: what separates outcomes is not technology but organization. When coding's autonomy moves to its next stage, what decides it won't be model performance but how the engineering org designs, judges, and builds in verification. The shift has already begun. The organizations that don't put off that design work are the ones that will pull ahead.

source & further reading

dev.to — original article SKILL.md: how to write a Claude Code skill that actually triggers (format + template) Two AI models that attack each other beat one that agrees with itself Understanding Vector Databases: A Beginner's Guide to Embeddings and Similarity Search

Why Coding Stays in Human-AI Collaboration: A Paradox in Stanford's 51 Deployments

Run your AI side-project on zahid.host