Agentics: Cost to Implement vs Cost to Verify

Nori CTO Clifford introduces a framework for evaluating coding agent tasks based on cost to implement (Ci) and cost to verify (Cv), arguing that the industry's focus on model capabilities misses the real bottleneck of verification. He warns that when both costs are high, developers risk falling into a trap of over-reliance on agents without adequate verification.

Agentics: Cost to Implement vs Cost to Verify Slop slop slop slop slop slop slop. A framework for how to avoid it. Guest post from Clifford, my CTO and co-founder at Nori. His posts always do way better than mine, and it’s a shame he doesn’t write as much as I do because he has real insights whereas I mostly just shout my opinions at the cloud. You can see the original here. Also I will be in SF from the 29th to the 3rd DM me if you want to catch up Also, also, if you or your team is trying to figure out how to use AI at your company — if you are struggling with costs, or with security concerns, or with roll-out to nontechnical staff — take a look at Nori Sessions It’s now generally available. On average, Nori makes teams 2-5x more productive across sales, ops, and eng. AI promises the future of work. Nori Sessions is how you get there. The Wrong Scoreboard The discourse on coding agents has been obsessing for the past year over the wrong question. The main focus has been what models can do : lines written, autonomous minutes, benchmark scores, model cards, percent of lines shipped by AI. These are all generalized measures of implementation throughput. Useful for a bird’s-eye view of model progress, but they say almost nothing about where the actual bottlenecks now live. The operative question for practitioners in 2026 is not what tools can do, it’s what you should ask them to do. Answering the “should” question requires a different lens than the capability benchmarks provide. Every task you might hand to a coding agent has two costs that matter: the cost to implement Ci — the time and expertise needed to produce the code — and the cost to verify Cv — the time and expertise needed to confirm the code is correct. The relationship between these two variables determines whether delegation is a net win or a liability. Aside: About “Delegation” When I first outlined this in November 2025, I was comparing handcoded vs AI-delegated implementations. My workflow has changed significantly since then: I rarely hand-write code. The relevant choice for me is now between pair programming with the agent high-touch, Socratic, every structural decision is guided and delegating agent leads research, planning, and implementation; you just review the output of each phase . The pair programming model is mentally just as involved as writing code, but mechanically faster. The delegation model is now very different, allowing you to run and ship five separate feature PRs in parallel not some clickbait Xitter “I ran 100 agents in parallel today” make-work slop, but five actual product increments, in parallel, in a brownfield codebase . Whatever the threshold of delegation is, in my experience the framework below applies. The Two-Variable Framework When both costs are low, it doesn’t matter what approach you take — the task is trivial either way. When Ci is high but Cv is low, delegate freely; the implementation is a job for the agent, and you can cheaply confirm the result. The inverse is equally clear: when Ci is low but Cv is high, build a detailed mental model by taking part in every step of the process. The dangerous quadrant is top-right. When both costs are high, there’s a huge incentive to spin the slot machine many times, and see if the agent just happens to nail the task. Compared to hand coding, where you burn days or weeks to ascertain the quality, the agent might have a chance to succeed at the same or higher quality after just 60 minutes of work. For complex or off-distribution work, it may be a small chance... but that makes it even more tempting By skipping the mental effort, you go in blind on an equally demanding task: verification. This is the trap. The models have dramatically compressed Ci across the board. Cv has not moved at the same rate — and in many cases, without careful developer intervention, it has gotten worse. Vibecoding and the Unaddressed Variable Vibecoding is the logical extreme of treating C<sub i</sub as the only variable. Previously, architecture decisions were bottlenecked by implementation cost. Releasing that constraint completely, without addressing verification cost, is a big failure mode. Any frequent flyer on Claude Code has experienced this, as an end user of an entirely AI-coded application — the constant issues with UI bugs, unintended changes to history cells, broken permission models... I've written about the flickering issues https://clifford.ressel.fyi/blog/drawing-monospace-text before, and I've been annoyed that sandboxing persistently pollutes https://github.com/anthropics/claude-code/issues/16022 the https://github.com/anthropics/claude-code/issues/17727 workspace https://github.com/anthropics/claude-code/issues/28189 with https://github.com/anthropics/claude-code/issues/29316 empty https://github.com/anthropic-experimental/sandbox-runtime/issues/139 files https://github.com/anthropic-experimental/sandbox-runtime/issues/85 this issue has been recurring in different forms for three months now . Users of the Claude Web environment, of Cursor, and many other almost fully AI-coded products experience the exact same degradation of quality as the software grows so rapidly. It’s not just that more features lead to proportionately more bugs. When you don’t build a mental model of the codebase, you’ve skipped your first pass on verifying the logic, and you’ve gone without a map of what parts need verification. The consequence isn’t just bugs — it’s verification blindness : you don’t know what you don’t know. This is a common failure mode that many teams have fallen into, particularly startups that feel the keenest urgency to ship faster. Verification Debt As a result, the most common form of tech debt in these highly agentic codebases comes from growing your feature surface area too fast and too loose. Every agentic feature shipped without a corresponding verification investment degrades your ability to autonomously ship future features. This is a compounding liability, not a fixed cost — first it accumulates, and then because these changes can have cross-cutting technical concerns, or act as bad examples for future work, it compounds. Unit and integration testing become slightly more important, to compensate. But E2E behavioral verification becomes far more important , because that’s the layer the agent generally cannot self-evaluate on its own. Skipping this investment creates verification debt . Detour: What Spec-Driven Development Gets Right and Wrong The popular framing of spec-driven development is wrong on two counts. It’s not about making prompt copy-paste easier, and it’s not about closing the loop on “Ralph Wiggum” workflows — generate, test, regenerate. These framings chase short-term speedups that don’t touch the real bottleneck. The original insight of a specification is much more important: you cannot verify an implementation if you don’t know its intent. A specification is the textual or symbolic description by which different readers arrive at the same mental model. In distributed teams, software has long relied on PR review, ADRs, box and sequence diagrams — this is fundamentally the sharing of intent . You must know the developer’s intent before you can review their outputs. This is not new, it’s just now more urgent. The genuine unlock of specs is that they literalize the behavior you need to verify . Once combined with simulation environments — headless browsers, terminal puppeteering, API smoke tests — your specs become the instructions for agentic verification after agentic implementation. The loop closes not at the generation layer, but at the verification layer. Beware Reflexivity These two variables don’t stay independent — they affect each other over time. Shipping too quickly raises Cv as verification debt accumulates. Higher Cv in turn raises future Ci — the agent’s implementation speedup erodes as the codebase becomes harder to reason about, harder to test against, and bad patterns get committed and cargo culted. This is the mechanism by which agentic coding gains can crash down to earth, trash a codebase, and turn a team against AI coding tools entirely. But reflexivity runs both ways. The virtuous path runs in the opposite direction: to properly ship faster, teams must aggressively use the low cost to implement new tooling as a lever to decrease the cost of verification . Solutions: Making Verification Tractable Three concrete approaches, in increasing specificity: 1. Simulation environments at full breadth Standard integration/automation/E2E testing practices are table stakes, but the agent’s reach is now wider. Headless browsers, Chrome DevTools protocol, pseudo terminals, API smoke tests, microVMs, containers — the agent can drive all of these. The question is no longer “can we automate this” but “have we specified what to automate.” 2. Making the runtime legible. E2E tests don’t capture everything: service startup timing, internal program state, functional SLAs, “no UI interaction blocks for more than 2s”. An ephemeral, per-worktree observability stack — logs, metrics, traces — makes runtime behavior tractable to the agent. This is the difference between the agent knowing the tests pass and the agent understanding how the system is behaving . 3. Bespoke verification tooling is now nearly free. Personal example: while building a PTY proxy around several TUI tools, codifying system invariants and nightly fuzzing jobs cost only ~two extra hours. I implemented fuzzing with reproducible seeds and system state captures at failure. I can put my tool through more comprehensive testing than I could have ever justified without the agentic contributions. The economics of verification tooling have shifted — the agent makes it cheap to build guardrails and course correction, so the only remaining question is where those guardrails should go. The New Discipline Coding agents haven’t changed what good software engineering is, but they have changed where the leverage point is. The developers extracting the most durable value from these tools are the ones who have reinvested implementation gains into verification infrastructure. The question to ask before every agentic task is not “can the agent build this?” but “how will I verify what was built?” If you want two immediate, concrete steps to improve your verification tools: After you run through a cycle of research-plan-implement, your agent must go through the implementation using TDD. Here’s a general purpose agent skill to do so https://noriskillsets.dev/skills/test-driven-development .If you want an agent to go through a manual end-to-end test on your project, consider giving it the tools to interact directly. Here’s a skill for the agent to puppet a browser with Playwright https://noriskillsets.dev/skills/webapp-testing . Here’s a skill for the agent to puppet a TUI with tmux https://noriskillsets.dev/skills/tui-puppeteering-with-tmux . I am hoping that this will be just the first of three core posts about the new practices in software engineering. To signpost properly where I think this is going, here are the three ideas that are changing how I think about software development: Verification debt . this post Agent legibility . Compounding correctness . Lots of interesting things in this post that I agree with. See also my very old post on the point of code review, and Cliff’s presentation at the May Agentics NYC meetup. Agentics is the study of how to use and reason about agents. If you are an expert in coding agents, or interested in learning more about agents, join our community slack. More articles here. Learn more about how Nori can bring your company into the glorious AI future at norisessions.com.