The New Software Lifecycle

Google published a whitepaper co-written by Addy Osmani on the agentic software development lifecycle, detailing how agents function as a model plus a harness, with context engineering and verification as key differentiators. The paper introduces a static/dynamic context split that determines token costs and outlines how each SDLC phase changes when agents are involved.

The New Software Lifecycle June 16, 2026 I co-wrote a Google whitepaper on the agentic SDLC. Past the fundamentals, here are the concrete mechanisms that matter: why an agent is mostly harness, the static/dynamic context split that decides your token bill, verification as the real differentiator, how each SDLC phase changes, and the point where your prototype becomes the production agent. With six figures from the paper. Google published a whitepaper I co-wrote this week, The New SDLC With Vibe Coding https://www.kaggle.com/whitepaper-the-new-SDLC-with-vibe-coding , with Shubham Saboo https://www.linkedin.com/in/shubhamsaboo/ and Sokratis Kartakis https://www.linkedin.com/in/kartakis/ . It is the first of a short series. A Day 1 paper has to cover the fundamentals: what an agent is, what vibe coding means, why the work is moving from writing code to judging it. If you read this blog you have all of that, so I will skip it and go straight to the mechanisms in the paper that are concrete enough to act on, plus six figures worth stealing for your own talks. An agent is a model plus a harness The most useful framing in the paper: an agent is a model plus a harness. The model is one input. Everything that makes it actually finish work is the harness: the instructions and rule files, the tools and MCP servers, the sandboxes it executes in, the orchestration logic that spawns sub-agents and routes between models, the hooks that fire deterministic code at lifecycle points, and the observability that tells you whether it is drifting. The model is the engine. The harness is the car, the road, and the traffic laws. Two public results make the size of the effect concrete. On Terminal Bench 2.0, a team moved a coding agent from outside the Top 30 into the Top 5 by changing only the harness, with the same model underneath. LangChain added 13.7 points on the same benchmark by changing only the system prompt, tools, and middleware around a fixed model. So when an agent fails, debug the harness first. The usual culprits are a missing tool, a rule written too loosely, a guardrail you never added, or a context window full of noise. Most agent failures are configuration failures, which is the optimistic read, because configuration is the part you can fix today without waiting on a new model. Models will keep getting swapped under your harness; the teams that compound value build the harness once and refine it across projects. I have written the long version as harness engineering https://addyosmani.com/blog/agent-harness-engineering/ and the factory model https://addyosmani.com/blog/factory-model/ . Context engineering has a shape, and it decides your bill If the harness is the system, context engineering is the highest-leverage knob inside it. The paper breaks agent context into six types: instructions role, goals, boundaries , knowledge docs, diagrams, domain data , memory session logs and persistent state , examples few-shot demos and reference patterns , tools the APIs and services it can call , and guardrails hard constraints and safety rules . The decision that actually moves your token bill is what goes in static versus dynamic context. Static context is reliable and expensive because it is present in every turn. Dynamic context is cheap because you only pay for what the task needs. Static context is always loaded: system instructions, rule files AGENTS.md , CLAUDE.md , GEMINI.md , global memory, core guardrails. Dynamic context is loaded on demand: skills triggered by a task match, tool results retrieved mid-execution, documents pulled from RAG. Too much static context wastes tokens and dilutes signal; too little and the agent forgets the rules that keep it safe. The pattern worth adopting is to treat that boundary as a first-class architectural decision, reviewed in pull requests and versioned like config. The mechanism that makes dynamic context scale is Agent Skills with progressive disclosure: the agent sees lightweight metadata at startup, loads full instructions when a task matches, and pulls deep reference material only when it is explicitly needed. That is how one agent can carry dozens of specialized capabilities while paying the token cost for only the one it is using right now. I have written more on this in agent skills https://addyosmani.com/blog/agent-skills/ . Verification is the line between vibe coding and engineering You can sit anywhere on the spectrum from vibe coding to agentic engineering with the same agent. What moves you along it is how outputs get verified. The right position depends on the stakes. The skill is knowing where to draw the line for each task. Two mechanisms do the work. Tests verify the deterministic parts: given this input, the function returns that output. Evals verify the parts that are not deterministic, and the paper splits them in a way I find useful: output evaluation asks whether the final artifact is correct, and trajectory evaluation asks whether the sequence of tool calls and intermediate reasoning was sound. Both matter, because a fluent output that skipped its verification steps is a more dangerous failure than one with a visible error. The recommendation I would give a leader verbatim: set the bar at the eval, not the demo. A demo proves an agent can succeed once. A passing eval suite with a real rubric proves it succeeds reliably. This is the same argument I keep making in verification is the bottleneck https://addyosmani.com/blog/verification-bottleneck/ and agentic code review https://addyosmani.com/blog/agentic-code-review/ . How each SDLC phase actually changes AI compresses the lifecycle unevenly, and the unevenness is the point. Implementation that took weeks takes hours. Requirements, architecture, and verification stay human-paced because they are judgment work, so specification quality becomes the bottleneck and verification moves to the centre. Same phases, different bottlenecks, different proportions. Phase by phase, concretely: Requirements stop being a document handed between teams and become a conversation that produces a spec and an initial prototype at once. The agent drafts user stories from a brief, surfaces edge cases, and turns a description into a working prototype in minutes. Architecture is the most stubbornly human phase. Trade-offs like consistency versus availability depend on business context the model cannot fully grasp, so the developer’s job shifts to making and documenting the structural decisions the agent then implements. Implementation is where the gains and the caveats both live. Surveys report 25 to 39% productivity improvements, while a METR study https://metr.org/blog/2026-02-24-uplift-update/ found experienced developers taking 19% longer on some tasks once you count verifying and correcting. The honest read: AI turns implementation from writing into reviewing. Testing and QA flips. Tests and evals become the primary way you communicate intent to the agent, wired into a continuous quality flywheel: evaluate against a benchmark, cluster the failures, fix the prompt or tool that caused them, verify against a regression suite, monitor production for new failure modes. Maintenance is the most underrated. Code that was “too risky to touch” because only its authors understood it can now be read, refactored, and modernized by an agent. Framework migrations and deprecation cleanups that never happened because they were tedious and risky now happen. The ceiling on all of it is still the 80% problem https://addyo.substack.com/p/the-80-problem-in-agentic-coding : agents generate the first 80% of a feature fast, and the last 20%, the edge cases and the integration seams, demands the deep context current models often lack. The economics: context and routing are financial levers The metric that matters to a leader is not velocity, it is Total Cost of Ownership, and the AI era splits it in a way that inverts the usual intuition about which approach is cheap. Past the crossover, vibe coding costs 3 to 10x more per feature. How long the code has to live decides whether you reach it. Vibe coding is low CapEx, high OpEx. The barrier to entry is a subscription and a few prompts; the hidden cost is operational. There is token burn from dumping unstructured files into context and asking the model to fix its own mistakes, a maintenance tax when that ad-hoc code has to be reverse-engineered later, and security remediation when fast generation produces vulnerabilities at the same rate as features. Agentic engineering inverts the curve: higher CapEx upfront schemas, test suites, structured context for a low marginal cost per feature after. I will be straight about the chart, since I helped build it: the crossover multiplier, vibe coding costing 3 to 10x more per feature, is illustrative rather than a measured constant. The shape is the claim. The takeaway for developers is that context engineering and intelligent model routing are financial levers, not just technical ones. Passing a 100,000-token repository into every prompt is not viable at scale. A well-designed factory routes hard reasoning requirements, architecture, initial implementation to large models and routes deterministic, lower-complexity work test generation, code review, CI monitoring to smaller, cheaper, faster ones. You hold output quality while driving the token bill down. That is the economic engine under what I have called the orchestration tax https://addyosmani.com/blog/orchestration-tax/ and the innovation budget https://addyosmani.com/blog/innovation-budget/ . Where this is going: the prototype becomes the production agent This is the most forward-facing idea in the paper, and the one I would watch most closely. The same terminal workflow that produces a throwaway script now produces a production agent, in the same place, often by talking to the same coding agent you were already using. Building, evaluating, and deploying a real agent persistent memory, scoped permissions, eval coverage, observability used to mean a separate stack and a separate workflow. Now it collapses into the loop you already run. Google’s Agents CLI https://google.github.io/adk-docs/ is built around this: after a one-time install, the coding agent you prefer gains skills covering the full lifecycle, and you drive it in natural language. one-time setup uvx google-agents-cli setup then, in your coding agent: Build a support agent that answers questions from our docs. Evaluate it on the FAQ dataset. Deploy it to Agent Engine. Behind that single instruction, the agent scaffolds the project, writes the code, generates an eval set, runs it, deploys to a managed runtime, and reports back. The prototype that ran on your laptop yesterday becomes the production agent serving users today, without a rewrite. Coordination across agents rides on open standards: MCP for tool access, A2A for cross-agent delegation. The paper cites an Anthropic experiment where agent teams built a working C compiler in Rust over two weeks, with humans setting direction and reviewing rather than writing the implementation. Day to day, you move between two modes the paper names the conductor real-time, in-IDE, keystroke-level control, best for exploration and unfamiliar code and the orchestrator async, multi-agent delegation, goal-level control, best for well-specified work like migrations and test generation . The tooling now supports both at once: inline completion in the editor, a goal handed to a terminal agent, and background agents that take a paragraph and return a pull request hours later. The conductor-to-orchestrator https://addyosmani.com/blog/future-agentic-coding/ shift is a skills shift before it is a tooling one. The slide for everyone else The last figure is not for you. It is for the people you are bringing along: the exec who still thinks this is fancy autocomplete, the colleague who has not made the jump. Each generation preserved what came before while raising the ceiling on what one engineer could accomplish. It carries the adoption numbers that tend to end the “is this real yet” conversation: as of early 2026, 85% of professional developers use AI coding agents regularly, 51% daily, and roughly 41% of new code is AI-generated. Where to start this week The paper closes with recommendations for individuals, leaders, and organizations. The concrete ones I would act on first: - Put an AGENTS.md in the repo. Ten lines is enough to start: stack, conventions, hard rules, workflow. Add a rule every time the agent does something it should not have. - Write the tests and evals before the code. They are the contract with the agent, and a good eval suite communicates intent more precisely than any natural-language prompt. - Pick one repetitive workflow and make it your first real agent, end to end. Use a coding agent for the prototype and graduate it to production when it earns its keep. Building one teaches more than reading about a hundred. - Treat the harness, the eval suites, and the static context as shared, versioned team assets, not personal scripts. The teams that compound the most value build their harness once and refine it many times. The principle underneath all of it is the one I would put on the wall: AI amplifies your engineering culture, multiplying both your strengths and your weaknesses. Generation is increasingly solved. The work that is left, and the work worth getting good at, is specification, verification, and the systems that hold them together. The full paper is here https://www.kaggle.com/whitepaper-the-new-SDLC-with-vibe-coding .