The Operating Model Was the Upgrade, Not the AI

wpnews.pro

In July 2025, METR published a result that should have cooled a lot of overheated arguments. They ran a randomized controlled trial: sixteen experienced open-source developers, working in repositories they knew intimately, doing real tasks — some with early-2025 AI tooling allowed, some without. Before they started, the developers predicted the AI would make them about 24% faster. Afterward, they reported that it had. The measurements said the opposite: they were about 19% slower with the AI than without it (arXiv:2507.09089).

Sit with that gap, because it is the most honest finding in the paper. These were strong engineers. They expected a speedup, experienced a speedup, and were wrong about their own work. The tool felt like help while quietly costing them time.

We read that study the same season we measured the opposite outcome. At fortiss, a research institute in Munich, our platform engineering competence field had spent about a year building the Punctilious Platform — a production-ready, multi-tenant system serving our team to rapidly build prototypes for our customers across industries— with AI as the primary engineering harness. We did not get slower. Output and cadence rose sharply. (Cadence is a soft, easily-inflated signal; the durable, defensible numbers come later, each with its caveat. For now, treat it as “a lot,” not as proof.)

Same class of model that slowed METR’s experts down. Opposite sign — though theirs is a randomized trial and ours an observational n = 1, a gap we hold honestly all the way through. The temptation is to explain one result away. The explanation that holds is simpler:

The sign of AI’s effect is not a property of the model. It is a property of the system around the model — and of which work you point it at.

That system — the methods, structure, and governance wrapped around the model — is what we call the operating model. This article is the whole case for it, condensed from a seven-part series; each section links to the longer post if you want the expanded version.

The short version

— A 2025 RCT found expert devs ~19%slowerwith AI. On one codebase, with the same model class, our team got dramatically faster.

— The resolution: the leverage was never in the model — it was in the operating model we built around it. That artifact is transferable; the model is the cheap part.

— The honest numbers: ~40 services and ~3,100 reviewed merges to main (durable, load-bearing); a ~22–28× per-engineer-hour figure that is alabel-imputed proxy, ~99% imputed, not a measurement; rework reported as a3.5%–17.9% range with no trend; cost aself-reported, unauditedceiling.

— The limit:n = 1, one site, one small operator team, platform engineering only. Whether it transfers is the open question — and the last section is an experiment you can run in two weeks.

The whole argument, in one map — every concept in this piece and how it connects: the paradox at the top, the operating model as the pivot, and everything that follows from it being a method, not a model.

If you have followed the AI-in-engineering debate, you have watched it fracture into two camps talking past each other. One says AI is a force multiplier; the other says it is a confident liability that generates plausible nonsense. Both are describing real experiences. What they miss is that they are running different systems and crediting or blaming the model. The same tool is a multiplier for a team that surrounded it with structure and a tax for a team that dropped it onto a bare codebase and asked it to “be helpful.” Argue about the tool and you cannot explain that split. Look at the system around the tool and it explains itself. METR’s developers had a capable model and no system around it. We had the same capability and a system designed, over months, to make it pay. The variable that moved was not the silicon.

The rest of this is what that system is, where it came from, how it stays auditable, what it actually produced, and whether any of it reaches your team.

2 · Why an unscaffolded AI forgets everything you taught it yesterday

Picture the best contractor you have ever worked with. Sharp, fast, widely read. Now add one constraint: every morning they arrive having forgotten everything about your project. Not the language or the craft — those they keep. But your codebase, your conventions, the reason the payments module has that retry logic, the bug that bit you last Tuesday and how you fixed it: gone. A clean slate, every morning.

That is an unscaffolded AI. The intelligence is real and persists; the context evaporates between sessions. Give it a small task — “add error handling to this new path” — and watch what it does first. It opens files to rediscover how this codebase handles errors. It re-derives the convention you settled months ago. It re-explains the pattern it used last week. And if there was a hard-won lesson — “don’t catch this exception here, it swallows a failure we care about” — it re-learns it, possibly by reintroducing the exact bug the lesson warned against.

None of that work produces anything new. It is the cost of starting over. And it is invisible: re-doing looks exactly like working — text streaming, files opening, plausible motion. You don’t feel yourself paying twice; you just notice the help didn’t help as much as it should have. Which is roughly what METR’s developers reported.

The naive fix — “give the AI a memory” — is a trap, and worth saying so plainly. Stuff too much into a session and you drown the signal; the model attends to the wrong things and gets worse. And a wrong remembered lesson is more dangerous than none: a confidently-recalled “we always do it this way” that is no longer true steers every future session off a cliff, quietly. Memory that isn’t curated doesn’t compound; it rots. So the goal is not “remember everything.” It is “make the right prior knowledge cheap to retrieve, keep it true, and let nothing essential get rediscovered from scratch.” That is a system problem, not a model setting — and once you see the cost as repeated work, the objective flips: you stop coaxing more intelligence out of the model and start removing the classes of work it would otherwise redo. Each class you remove stays removed, so the saving compounds instead of resetting.

3 · Seven layers between a model and real leverage

The repeated work isn’t a vague blob; it sorts into named taxes, and each has a fix. Rather than list them in the abstract, follow one deliberately ordinary change all the way through — add a rate-limited variant of an existing API endpoint, the kind of task a platform does a hundred times — and watch each layer remove one tax.

The example isn’t hypothetical. That rate-limit change is pinned by real tests — one fails the build if a throttled route ever returns a 500 instead of a 429; a sibling asserts the health check is never gated. A definition of done, made executable. Accumulate enough of these and they become something worth naming: eval data — a growing body of checks that encodes what good means in your domain, to your standard rather than the vendor’s. Of everything in the stack, it’s the asset you own most cleanly. When the model underneath changes, your eval data is what tells you whether the new one is actually better for your work — the difference between adopting an upgrade and hoping for one.

Add the layers up and you get more than a sum: they feed each other. CI delegation keeps the session’s attention free, which is exactly the attention retrieval then spends well. A lesson learned often enough becomes an encoded skill. Each removed tax makes the next removal cheaper — which is why seeing the effect took a coordinated change, not seven independent tweaks, and why a partial build often delivers nothing. The compounding doesn’t start until a core sits together: structured work items, CI delegation, and durable memory. Below that, the layers don’t compound; they just sit there.

And none of it is free. These layers are infrastructure, and infrastructure is front-loaded: someone has to write the definitions of done, encode the skills, curate the memory so it stays true, keep CI honest. That cost is real, lands before the payoff, and never stops — a memory base nobody prunes becomes the rotting well from the section above. The leverage is real, but it is earned, and the bill arrives first.

The model is the cheap part. The expensive, valuable, transferable thing is the system that keeps it from starting over.

4 · The advantage you rent vs. the one you own

Those seven layers could look like a pile of habits we happened to accumulate. They aren’t. They are an instance of a method our competence field has researched and published for years — and the interesting part was pointing it at ourselves.

When fortiss builds a platform for a domain — an energy grid, a public-administration process — we do three things in order. First we model the domain: write down its concepts, their relationships, and the rules that constrain them, precisely enough that both people and machines can use it. A “meter,” a “tariff,” a “consumption record,” and how they relate, stop living in tribal knowledge and become a shared vocabulary. Then we instantiate that model with a specific case’s real assets — this utility’s meters, this region’s tariffs. Then we apply it: software reasons and queries over the result to do something useful. Model, instantiate, apply — MIA, written up in our public “Platform Engineering” whitepaper (ISSN 2700–2977). We’re not coining anything; we’re telling you where the layers came from.

The change that produced this series was aiming MIA at our own work — treating “building software with AI” as the domain to be modeled. Re-read the seven layers through that lens and they line up closely. We model the work itself in a form the harness can read: structured work items with explicit definitions of done, recorded decisions, rules encoded as tests (layers 1–2). We instantiate it with reusable skills, curated memory, and per-tenant configuration — the assets the AI draws on instead of re-deriving them (layers 5–7). We apply it as a governed, checked loop that produces verified output, and every lesson folds back into the model for next time (layer 4 and the review gates).

Turning your method on yourself is not a victory lap. It means subjecting your process to the scrutiny you sell to clients — writing down what you actually do, exposing where it’s vague, finding where your “method” was really just habit. We found plenty; our own memory curation was ad hoc long before we admitted it needed a discipline. But it buys two things. First, principled beats accidental: because the layers are an instance of a method, they can be explained, defended, and transferred — the difference between a system and a lucky streak. Second, and this is the strategic core: none of the method is model-specific. MIA describes work, not a particular AI. A new frontier model may need the skills re-tuned, but it inherits the structure; you reconfigure, you don’t rebuild.

If your AI advantage lives in the model, you’re renting it, and the rent resets every release. If it lives in your operating model, you own it. So far this is about speed. The other half — in our world, the half that decides whether you’re allowed to play at all — is proof. The Punctilious Platform serves German public-sector institutions, including two Bavarian state ministries. There, “the AI built it quickly” is not, by itself, a point in your favor. The question that matters comes later — the kind the EU AI Act anticipates with its technical-documentation and record-keeping duties (Annex IV; Article 12): what changed, who decided it was allowed, and can you prove the system still obeys the rules it claimed to follow? If you can’t answer that, speed is irrelevant, because you can’t ship at all.

The operating model that produces the leverage is the same model that produces the auditability — not a trade-off you balance, but one discipline seen from two angles. Suppose an auditor asks: “You store this category of personal data this particular way — who decided that, on what basis, and can you prove the system still enforces it?” On most projects that triggers an archaeology dig through chat logs. Here it’s mechanical. The decision is a recorded, numbered, reviewed architectural decision with its rationale attached — so “who decided and why” is a lookup. The constraint that enforces it is a structural test that fails the build if violated — so “does the system still obey it” is answered by the fact that today’s build is green. Neither artifact was created for the auditor; they exist because they also make the AI faster — the recorded decision stops it re-litigating settled questions, the test stops it drifting off the rules. The audit answer falls out of the productivity machinery.

This is the spot where it would be easy to oversell, and a regulated audience can smell it. Governance as code is not a guarantee of good judgment. A structural test only checks the rule you thought to encode; the rule you didn’t imagine is unprotected, and the suite’s confidence can lull you into not looking for it. An audit trail records what happened, not whether it was wise. Recorded decisions can be well-documented and wrong. What the system gives you is not correctness but accountability— the ability to see, prove, and trace, with the accountable party a named human by design, never the AI. It constrains drift and makes judgment inspectable. It does not replace judgment, and any vendor who claims their governance layer does is selling you a comfortable story.

The hard part of AI in regulated work was never the intelligence. It’s the accountability — and accountability is a property of the operating model, not the model.

Five movements of argument; here’s where we owe you evidence. The honest move is to say what each number is worth, because the fastest way to lose a technical reader is one clean, flattering figure.

Lead with what stands, not how fast it moved. The load-bearing evidence is the standing system: roughly 40 services across 6 tenant overlays, 210 recorded architectural decisions, about 5,400 structural-invariant test functions, around 1,500 curated knowledge entries, 43 governance gates, on the order of 2.2 million tracked lines, and about 3,100 changes merged to the main branch, each reviewed. Direct counts of a durable system — evidence the work accumulated, which is not yet evidence of what produced it.

The reflexive objection to any AI output claim is “it’s just churn.” So here is the anatomy of a commit, as a distribution rather than an anecdote: the median commit is 80 lines across 2 files; only 6.1% are two lines or fewer; about 59% are larger than 50 lines; around 97% are conventionally typed; and there are 0.11 deletions for every insertion. That last ratio is the one we’d defend — low deletions-per-insertion means the codebase is accumulating, not thrashing. You can argue what the work is worth; you can’t call it churn with the numbers in front of you.

Now the soft numbers — each with its caveat in the same breath, because that’s the whole point.

And the caveat on the caveats: even the measured numbers are not proof of the thesis. They’re consistent with it. This is one site, observed over time, not a controlled comparison — the durable output is real, but it can’t on its own rule out that we’d have built something similar by other means.

A number you can’t caveat is a number you shouldn’t publish. The caveats aren’t the fine print here. They’re the finding.

This is the part where this kind of argument usually goes quietly vague. We’d rather be precise. Everything above is n = 1: a single site, a single codebase, the work directed by one human operator coordinating about seven configured AI session-identities — together roughly 95% of the commits, named human teammates the rest — which also means we can’t cleanly separate the operating model from the one operator who ran it. And it is platform engineering: composition, integration, configuration, governance, operation. It is not the whole software lifecycle — not requirements elicitation, large open-ended refactors, or novel algorithm design, none of which sat on the critical path. If your hard problem is one of those, we’ve shown you nothing about it.

We owe the opening an honest correction, too. The METR comparison is apples to oranges, and we’ve treated it that way on purpose: METR ran a randomized, controlled trial on repositories developers already knew; ours is an observational result from a single project with no control group. We are not refuting METR — you can’t refute a randomized trial with an anecdote, however well-instrumented. We’re offering an existence proof of a different operating point: different work, different system, different sign. “Both can be true” has been the load-bearing idea throughout, and this is where we make good on it.

So we’ll stop short of telling you it will work for you, and show you how to find out instead. Don’t adopt seven layers — that’s a months-long commitment and you shouldn’t take our word for it. Pick one: the layer that maps to wherever your AI sessions waste the most time. If they keep re-deriving what “done” means, start with structured work items. If they keep relearning what the team already knows, start with durable memory — and curate it so it stays true. Stand up that one layer, run it for two weeks, and measure one thing: the time your sessions spend rediscovering what was already known. One layer won’t give you the compounding — that needs the full set — but it tells you whether that class of waste is real for you, and removable.

Then tell us, in the responses: which class of repeated work did the layer remove — and, more usefully, which did it refuse to touch? That second answer is the one we can’t get from inside our own project. It’s how we find the real boundary of this idea, together.

The model is the cheap part. The operating model is the work. How far it transfers is a question we’d rather answerwithyou than pretend we’ve already settled.

If the operating model is the lever, the most valuable thing any of us can do isn’t to find a better model — it’s to build the system that makes the model we already have pay. We’ve shown you ours, seams visible. Go find the edges of yours. — -

This is the condensed story. Each part is a short standalone post with more examples and detail:

1. [Same Model, Opposite Sign](https://medium.com/@dian-balta/same-model-opposite-sign-ba5ab4f52a20) — the paradox and the thesis.

2. [Why AI Forgets Everything You Taught It Yesterday](https://medium.com/@dian-balta/why-ai-forgets-everything-you-taught-it-yesterday-451e831f6537) — the mechanism: repeated work.

3. [Seven Layers Between a Model and Real Leverage](https://medium.com/@dian-balta/seven-layers-between-a-model-and-real-leverage-a7859e0c1ead) — the system, followed through one real change.

4. [The AI Advantage You Rent vs. the One You Own](https://medium.com/@dian-balta/the-ai-advantage-you-rent-vs-the-one-you-own-a657370be95a) — the method (MIA), turned inward.

5. [How You Run AI Where Auditors Are Watching](https://medium.com/@dian-balta/how-you-run-ai-where-auditors-are-watching-db196d175f5e) — governance as code.

6. [How to Measure AI Leverage Without Lying to Yourself](https://medium.com/@dian-balta/how-to-measure-ai-leverage-without-lying-to-yourself-2ea7651b6c15) — measurement integrity.

7. [Does Any of This Transfer to Your Team?](https://medium.com/@dian-balta/does-any-of-this-transfer-to-your-team-f57ac11e702d) — the limits, and the experiment.

Going deeper: Designing the Space, Not the Answer— the conceptual foundation beneath the series: why context engineering is the deliberate design of a solution space, not answering one prompt at a time.

— -

Dian Balta leads the Platform Engineering competence field at fortiss, a research institute in Munich. This is the single-read companion to the seven-part series “The Operating Model,” itself the public companion to a forthcoming fortiss whitepaper.

The Operating Model Was the Upgrade, Not the AI was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article The Only Loop Engineering Roadmap You Need to Build Production-Ready AI Agents! Run the Neo4j MCP Server Locally with Docker (No Codespaces Needed) I Tested Claude Sonnet 5 vs Opus 4.8

The Operating Model Was the Upgrade, Not the AI

Run your AI side-project on zahid.host