The Dirty Secret Behind Loop Engineering

Loop Engineering, a 2026 AI workflow pattern where developers build systems that prompt agents instead of doing it manually, has a dirty secret: without a proper evaluation component, the loop becomes an infinite runaway process. PostHog deployed it in production, achieving an 11% performance improvement and fixing a three-year-old defect. The key is to define minimal specs and let the agent iterate until the evaluation condition is met.

Everyone is talking about Loop Engineering. Apparently, you don't need to program anymore. TL;DR: Loop Engineering is the hottest AI workflow pattern of 2026. But it hides a dirty secret. // Detect dark theme var iframe = document.getElementById 'tweet-2063697162748260627-249' ; if document.body.className.includes 'dark-theme' { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2063697162748260627&theme=dark" } In June 2026, Addy Osmani https://addyo.substack.com/p/loop-engineering and the PostHog team https://newsletter.posthog.com/p/why-were-bullish-on-loops published their takes on the same idea. Instead of prompting an AI agent manually, you build the system that prompts the agent for you. The metaprompt idea has a fancy name now. Loop Engineering . Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead. Addy Osmani PostHog ran it in production. The result: an 11% performance improvement and a 3-year-old defect https://dev.to/mcsee/stop-calling-them-bugs-57gl fixed in the query engine, hands-off. The internet got excited. Again. Rightly so. But there's a dirty secret hiding behind the vocabulary. A functional loop has four parts: /goal The evaluation component is the key. Without it, you don't have a loop. You have an infinite runaway process. Good luck with your token bill You also need a harness https://dev.to/mcsee/ai-coding-tip-022-give-ai-a-harness-to-work-with-274a : the scaffolding that contains the agent, enforces your rules https://dev.to/mcsee/object-design-checklist-2p4 , and gives the loop a safe boundary to operate within. Here's where Loop Engineering gets interesting, and where most people get it wrong. If you write an enormous spec covering all possible cases before the loop runs once https://dev.to/mcsee/ai-coding-tip-008-use-spec-driven-development-with-ai-1k0f , you aren't doing Loop Engineering. You're doing waterfall https://dev.to/mcsee/coupling-the-one-and-only-software-design-problem-2pd7 with extra steps. Think about what you want to verify. Not the whole system. One behavior. Let's build a FIFA World Cup 2026 group standings simulator using Loop Engineering. Germany, Ivory Coast, Ecuador, and Curaçao in Group E. Three rounds of matches. The top two advance, and the best third-place teams also qualify. What's the smallest possible spec? A team that wins a match gets 3 points. That's it. Not the whole group. Not the knockout bracket. One rule about one match. This is the Spec-Driven approach https://dev.to/mcsee/ai-coding-tip-008-use-spec-driven-development-with-ai-1k0f : you define intent before implementation, but you keep the scope surgical. Here's your first loop cycle. You define the evaluation condition before any implementation exists. python def test win gives three points : germany = Team "Germany 🇩🇪" curacao = Team "Curaçao 🇨🇼" match = Match germany, curacao, home goals=7, away goals=1 standings = GroupStandings standings.record match assert standings.points for germany == 3 assert standings.points for curacao == 0 Run this. It fails. Team doesn't exist. Match doesn't exist. GroupStandings doesn't exist. Germany beat Curaçao 7-1 on June 14, 2026. The spec matches reality. The loop condition is red 🔴. This is the signal the loop needs. The evaluation says: not done yet. Keep running until you achieve the /goal . Now you give the agent the minimal implementation to make the loop exit: python class Team: def init self, name : self.name = name class Match: def init self, home, away, home goals, away goals : self.home = home self.away = away self.home goals = home goals self.away goals = away goals class GroupStandings: def init self : self. points = {} def record self, match : if match.home goals match.away goals: self. points match.home = self. points.get match.home, 0 + 3 self. points match.away = self. points.get match.away, 0 elif match.away goals match.home goals: self. points match.away = self. points.get match.away, 0 + 3 self. points match.home = self. points.get match.home, 0 def points for self, team : return self. points.get team, 0 Run the spec. Green 🟢. Loop exits. Not because you modeled every rule. Because you satisfied the single condition the loop was checking. The loop restarts with a new goal: python def test draw gives one point each : ecuador = Team "Ecuador 🇪🇨" curacao = Team "Curaçao 🇨🇼" match = Match ecuador, curacao, home goals=0, away goals=0 standings = GroupStandings standings.record match assert standings.points for ecuador == 1 assert standings.points for curacao == 1 Red 🔴. The record method doesn't handle draws. Ecuador drew 0-0 with Curaçao on June 20. Again, the spec matches reality. The evaluation fails. Loop continues. Add the draw case. Green 🟢. Loop exits. Cycle by cycle, the spec expands: Each iteration follows the same pattern: write the evaluation condition first, run it it fails , implement the minimum to pass, run again green 🟢 , move to the next cycle. After 7 iterations, Group E final standings: Group E - Final Standings 1. Germany 6 pts GD: +6 GF: 10 2. Ivory Coast 6 pts GD: +2 GF: 4 3. Ecuador 4 pts GD: 0 GF: 2 4. Curaçao 1 pt GD: -8 GF: 1 The same loop discipline applies to bracket generation. python def test group winner faces different group runner up : bracket = KnockoutBracket completed group results round of 32 = bracket.round of 32 assert round of 32 0 .home == group e standings.first place assert round of 32 0 .away == group f standings.second place Red 🔴 first. Then green 🟢. Then the next spec. The loop doesn't know the full bracket before it starts. It discovers the bracket one evaluation at a time. None of this works without a structure that: That is the harness https://dev.to/mcsee/ai-coding-tip-022-give-ai-a-harness-to-work-with-274a . The harness is what separates Loop Engineering from running Claude in a while True loop and hoping for the best. Codex and Claude Code now ship with built-in loop infrastructure: /goal https://code.claude.com/docs/en/goal , /loop , isolation: worktree , and sub-agents for separate verification.The harness is no longer something you build from scratch. The agent that verifies runs in a clean sub-agent with no memory https://dev.to/mcsee/ai-coding-tip-005-keep-context-fresh-220e of what the implementer did. It is an independent inspector seeking Judgment Day moments. It can't grade its own work because it never saw the work being done. This is the same reason you don't ask a developer to review their own pull request. Why is this getting attention now and not five years ago? Because the evaluation step the part where the loop decides whether to continue used to require a human. Now it doesn't. When models were weaker, the loop needed you to interpret the evaluation output. Now the evaluation can be the test suite itself, and the agent reads it directly. You've been reading about Test-Driven Development https://www.youtube.com/watch?v=Xahv9nMegXA . The spec is the test. The evaluation is the test runner. The loop is the red 🔴-green 🟢-refactor 🔵 cycle. The goal is the failing assertion. Loop exits when evaluation passes means the test is green 🟢. Kent Beck described this in 2003 https://en.wikipedia.org/wiki/Test-driven development . Ward Cunningham was doing it before that. There's even a structured guide for choosing which goal to tackle next: the ZOMBIES framework https://dev.to/mcsee/how-i-survived-the-zombie-apocalypse-59gj . Zero, One, Many, Boundary, Interface, Exceptional, Simple. That is your loop iteration order. What changed isn't the technique. What changed is who runs the loop. In 2003, the human developer wrote the test, ran it, read the red 🔴 output, wrote the minimum code, ran it again, saw green 🟢, and moved to the next test. That was the loop. In 2026, the functional developer writes the spec, the agent runs the cycle, reads the red 🔴 output, writes the minimum code, runs the cycle again, sees green 🟢, and starts the next spec. That's still the loop. The red 🔴-green 🟢-refactor 🔵 vocabulary wasn't memorable enough for 2026. So the industry renamed it. The evaluation is still the test. The cycle is still TDD. The discipline is exactly the same. Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go. Addy Osmani Kent Beck said the same thing. He just called it something else. A few extra tips: Loop Engineering isn't only for greenfield code or fancy MVPs. It's also how you safely modernize systems that have no tests at all. The trick is the same: write the spec first. On a legacy system, that spec describes behavior the system already has. You're not inventing new rules. You're pinning existing ones so the loop can't break them. Harnesses are even more critical on production legacy systems. The loop then shrinks the untested surface one cycle at a time. Each green 🟢 spec is a behavior the agent can't accidentally destroy in the next iteration. Squeezing TDD onto legacy systems https://dev.to/mcsee/how-to-squeeze-test-driven-development-on-legacy-systems-8m9 works the same way whether a human runs the cycle or an agent does. The discipline is identical. What changes is the speed. What are you waiting for? Build your harnesses. Start your loops.