"An AI agent just built a production landing page, with GDPR audit logs and encryption baked in. I wasn't even at my desk."
That is not a lucky one-shot. It is a repeatable workflow. Piotr Karwatka recorded a full tutorial showing how to go from idea to a production-ready app on Open Mercato - the AI-Engineering Foundation Framework for CRM/ERP - with no babysitting and no ping-pong prompting.
This is the technical version: what the loop actually looks like, why it doesn't fall apart, and which patterns you can lift into your own stack.
The default AI coding loop is single-threaded and human-bound:
prompt -> generate -> you spot a bug -> correct -> re-prompt -> repeat
It holds for snippets. It collapses the moment the task touches real architecture - multi-tenancy, RBAC, event flow, encryption, audit logging. Corrections pile up in the context window, the agent loses the thread, and you are back to typing. You are the bottleneck, sitting in the inner loop.
The workflow in the tutorial moves you to the outer loop: you review a finished, tested PR instead of every keystroke.
goal -> agent: branch + implement + test + open PR -> you: review PR
The reason this is even possible on Open Mercato is that the hard architectural decisions are already encoded as conventions, specs and agent-readable skills (AGENTS.md
, task routing, spec skills). The agent is not inventing how RBAC or GDPR logging should work - it reads the foundation and follows it.
The execution agent owns the full unit of work:
1. git checkout -b feat/lead-capture-landing
2. implement against framework conventions
3. run the test suite (Playwright integration tests included)
4. open a structured PR: what changed, why, how it was verified
You are no longer correcting tokens. The deliverable is a reviewable artifact. In the tutorial the output is concrete: a live site capturing leads straight into the Open Mercato CRM, with GDPR audit logs and encryption on by default - not bolted on after a compliance pass.
main
This is the part most people get wrong. One agent is trivial. N agents in parallel usually means file collisions and a corrupted main branch.
The fix is isolation by design - each agent on its own branch/worktree, never writing to main
directly:
main
|-- agent-a -> feat/landing-page (worktree A)
|-- agent-b -> feat/crm-webhook (worktree B)
+-- agent-c -> feat/consent-logging (worktree C)
Parallelism is only useful if it is safe. Safety here is structural (separate branches/worktrees), not "hope the agents stay out of each other's way." This is what turns autonomous coding from a single-threaded demo into something that scales like a team.
The highest-leverage step happens before any code is written. Autonomous output is only as good as the spec, so the workflow generates the spec in two passes.
Phase 1 - architecture-compliant draft. A spec-writing skill produces a spec that already respects framework conventions instead of fighting them.
spec-skill -> SPEC.md (modules, data model, routes, events, RBAC scope)
Phase 2 - adversarial / "philosophical" review. A second pass deliberately hunts for hidden gaps the first draft missed before a line of code is committed.
review pass -> checks: routing, caching, edge cases, failure modes, consent flow
Model pairing matters here: Claude and Codex are used across the phases so the spec is both convention-compliant and stress-tested. The cost of a wrong assumption is highest at the start, so that is where the scrutiny goes. By the time code is written, the thinking is done.
Agents run autonomously for hours, which exposes the real enemy of long agent sessions: context burnout. A single agent grinding a long task fills its window with history and loses coherence.
The fix is hierarchical orchestration:
+---------------------+
| Coordinator agent | holds the plan, delegates, keeps context lean
+----------+----------+
+--------------+--------------+
v v v
exec agent A exec agent B exec agent C
(fresh ctx) (fresh ctx) (fresh ctx)
The coordinator owns the map; the workers own the tasks and run with fresh, scoped context. That separation is what makes unsupervised multi-hour runs possible without the quality collapse that usually follows.
Strip away the demo and three engineering principles remain:
The detail that is easy to skip: compliance was not a phase, it was a property of the foundation. For anyone shipping CRM/ERP in regulated markets, that is the whole game.
What is the longest you have ever let an AI agent run unsupervised? Drop it in the comments.