The Dirty Secret Behind Loop Engineering

wpnews.pro

Everyone is talking about Loop Engineering. Apparently, you don't need to program anymore.

TL;DR: Loop Engineering is the hottest AI workflow pattern of 2026. But it hides a dirty secret.

// Detect dark theme var iframe = document.getElementById('tweet-2063697162748260627-249'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=2063697162748260627&theme=dark" }

In June 2026, Addy Osmani and the PostHog team published their takes on the same idea.

Instead of prompting an AI agent manually, you build the system that prompts the agent for you.

The metaprompt idea has a fancy name now. Loop Engineering.

Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead.

Addy Osmani

PostHog ran it in production. The result: an 11% performance improvement and a 3-year-old defect fixed in the query engine, hands-off.

The internet got excited. Again. Rightly so.

But there's a dirty secret hiding behind the vocabulary.

A functional loop has four parts:

/goal

The evaluation component is the key. Without it, you don't have a loop. You have an infinite runaway process. Good luck with your token bill!

You also need a harness: the scaffolding that contains the agent, enforces your rules, and gives the loop a safe boundary to operate within.

Here's where Loop Engineering gets interesting, and where most people get it wrong.

If you write an enormous spec covering all possible cases before the loop runs once, you aren't doing Loop Engineering. You're doing waterfall with extra steps.

Think about what you want to verify. Not the whole system. One behavior.

Let's build a FIFA World Cup 2026 group standings simulator using Loop Engineering.

Germany, Ivory Coast, Ecuador, and Curaçao in Group E. Three rounds of matches. The top two advance, and the best third-place teams also qualify.

What's the smallest possible spec?

A team that wins a match gets 3 points.

That's it. Not the whole group. Not the knockout bracket. One rule about one match.

This is the Spec-Driven approach: you define intent before implementation, but you keep the scope surgical.

Here's your first loop cycle. You define the evaluation condition before any implementation exists.

def test_win_gives_three_points():
    germany = Team("Germany 🇩🇪")
    curacao = Team("Curaçao 🇨🇼")

    match = Match(germany, curacao, home_goals=7, away_goals=1)
    standings = GroupStandings()
    standings.record(match)

    assert standings.points_for(germany) == 3
    assert standings.points_for(curacao) == 0

Run this. It fails. Team

doesn't exist. Match

doesn't exist. GroupStandings

doesn't exist.

(Germany beat Curaçao 7-1 on June 14, 2026. The spec matches reality.)

The loop condition is red 🔴.

This is the signal the loop needs. The evaluation says: not done yet. Keep running until you achieve the /goal

.

Now you give the agent the minimal implementation to make the loop exit:

class Team:
    def __init__(self, name):
        self.name = name

class Match:
    def __init__(self, home, away, home_goals, away_goals):
        self.home = home
        self.away = away
        self.home_goals = home_goals
        self.away_goals = away_goals

class GroupStandings:
    def __init__(self):
        self._points = {}

    def record(self, match):
        if match.home_goals > match.away_goals:
            self._points[match.home] = 
                self._points.get(match.home, 0) + 3
            self._points[match.away] = 
                self._points.get(match.away, 0)
        elif match.away_goals > match.home_goals:
            self._points[match.away] = 
                self._points.get(match.away, 0) + 3
            self._points[match.home] = 
                self._points.get(match.home, 0)

    def points_for(self, team):
        return self._points.get(team, 0)

Run the spec. Green 🟢. Loop exits.

Not because you modeled every rule. Because you satisfied the single condition the loop was checking.

The loop restarts with a new goal:

def test_draw_gives_one_point_each():
    ecuador = Team("Ecuador 🇪🇨")
    curacao = Team("Curaçao 🇨🇼")

    match = Match(ecuador, curacao, home_goals=0, away_goals=0)
    standings = GroupStandings()
    standings.record(match)

    assert standings.points_for(ecuador) == 1
    assert standings.points_for(curacao) == 1

Red 🔴. The record

method doesn't handle draws.

(Ecuador drew 0-0 with Curaçao on June 20. Again, the spec matches reality.)

The evaluation fails. Loop continues. Add the draw case. Green 🟢. Loop exits.

Cycle by cycle, the spec expands:

Each iteration follows the same pattern: write the evaluation condition first, run it (it fails), implement the minimum to pass, run again (green 🟢), move to the next cycle.

After 7 iterations, Group E final standings:

Group E - Final Standings
1. Germany       6 pts  GD: +6  GF: 10
2. Ivory Coast   6 pts  GD: +2  GF: 4
3. Ecuador       4 pts  GD:  0  GF: 2
4. Curaçao       1 pt   GD: -8  GF: 1

The same loop discipline applies to bracket generation.

def test_group_winner_faces_different_group_runner_up():
    bracket = KnockoutBracket(completed_group_results)

    round_of_32 = bracket.round_of_32()

    assert round_of_32[0].home == group_e_standings.first_place()
    assert round_of_32[0].away == group_f_standings.second_place()

Red 🔴 first. Then green 🟢. Then the next spec.

The loop doesn't know the full bracket before it starts. It discovers the bracket one evaluation at a time.

None of this works without a structure that:

That is the harness. The harness is what separates Loop Engineering from running Claude in a while True

loop and hoping for the best.

Codex and Claude Code now ship with built-in loop infrastructure: /goal,

/loop

, isolation: worktree

, and sub-agents for separate verification.The harness is no longer something you build from scratch.

The agent that verifies runs in a clean sub-agent with no memory of what the implementer did.

It is an independent inspector seeking Judgment Day moments.

It can't grade its own work because it never saw the work being done.

This is the same reason you don't ask a developer to review their own pull request.

Why is this getting attention now and not five years ago?

Because the evaluation step (the part where the loop decides whether to continue) used to require a human. Now it doesn't.

When models were weaker, the loop needed you to interpret the evaluation output. Now the evaluation can be the test suite itself, and the agent reads it directly.

You've been reading about Test-Driven Development.

The spec is the test.

The evaluation is the test runner.

The loop is the red 🔴-green 🟢-refactor 🔵 cycle.

The goal is the failing assertion.

Loop exits when evaluation passes means the test is green 🟢.

Kent Beck described this in 2003.

Ward Cunningham was doing it before that.

There's even a structured guide for choosing which goal to tackle next: the ZOMBIES framework. Zero, One, Many, Boundary, Interface, Exceptional, Simple. That is your loop iteration order.

What changed isn't the technique. What changed is who runs the loop.

In 2003, the human developer wrote the test, ran it, read the red 🔴 output, wrote the minimum code, ran it again, saw green 🟢, and moved to the next test.

That was the loop.

In 2026, the functional developer writes the spec, the agent runs the cycle, reads the red 🔴 output, writes the minimum code, runs the cycle again, sees green 🟢, and starts the next spec. That's still the loop.

The red 🔴-green 🟢-refactor 🔵 vocabulary wasn't memorable enough for 2026. So the industry renamed it.

The evaluation is still the test. The cycle is still TDD. The discipline is exactly the same.

Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go.

Addy Osmani

Kent Beck said the same thing. He just called it something else.

A few extra tips:

Loop Engineering isn't only for greenfield code or fancy MVPs. It's also how you safely modernize systems that have no tests at all.

The trick is the same: write the spec first.

On a legacy system, that spec describes behavior the system already has.

You're not inventing new rules. You're pinning existing ones so the loop can't break them.

Harnesses are even more critical on production legacy systems.

The loop then shrinks the untested surface one cycle at a time.

Each green 🟢 spec is a behavior the agent can't accidentally destroy in the next iteration.

Squeezing TDD onto legacy systems works the same way whether a human runs the cycle or an agent does. The discipline is identical. What changes is the speed.

What are you waiting for? Build your harnesses. Start your loops.

source & further reading

dev.to — original article Least Privilege is a Workaround for a Missing Specification Will AI Replace Programmers? APC Defines the Project Contract. MCP Defines the Tool Protocol.

The Dirty Secret Behind Loop Engineering

Run your AI side-project on zahid.host