Fire-and-forget AI engineering: letting agents ship a production app unsupervised

An AI agent autonomously built a production landing page with GDPR audit logs and encryption, requiring no human supervision. Developer Piotr Karwatka demonstrated a repeatable workflow on Open Mercato, an AI-Engineering Foundation Framework for CRM/ERP, where agents implement features on isolated branches and open structured PRs for review. The workflow uses hierarchical task decomposition and adversarial spec reviews to ensure architecture compliance and avoid context burnout.

"An AI agent just built a production landing page, with GDPR audit logs and encryption baked in. I wasn't even at my desk." That is not a lucky one-shot. It is a repeatable workflow. Piotr Karwatka https://www.linkedin.com/in/piotrkarwatka/ recorded a full tutorial showing how to go from idea to a production-ready app on Open Mercato https://github.com/open-mercato/open-mercato - the AI-Engineering Foundation Framework for CRM/ERP - with no babysitting and no ping-pong prompting. This is the technical version: what the loop actually looks like, why it doesn't fall apart, and which patterns you can lift into your own stack. The default AI coding loop is single-threaded and human-bound: php prompt - generate - you spot a bug - correct - re-prompt - repeat It holds for snippets. It collapses the moment the task touches real architecture - multi-tenancy, RBAC, event flow, encryption, audit logging. Corrections pile up in the context window, the agent loses the thread, and you are back to typing. You are the bottleneck, sitting in the inner loop. The workflow in the tutorial moves you to the outer loop : you review a finished, tested PR instead of every keystroke. php goal - agent: branch + implement + test + open PR - you: review PR The reason this is even possible on Open Mercato is that the hard architectural decisions are already encoded as conventions, specs and agent-readable skills AGENTS.md , task routing, spec skills . The agent is not inventing how RBAC or GDPR logging should work - it reads the foundation and follows it. The execution agent owns the full unit of work: 1. git checkout -b feat/lead-capture-landing 2. implement against framework conventions 3. run the test suite Playwright integration tests included 4. open a structured PR: what changed, why, how it was verified You are no longer correcting tokens. The deliverable is a reviewable artifact. In the tutorial the output is concrete: a live site capturing leads straight into the Open Mercato CRM, with GDPR audit logs and encryption on by default - not bolted on after a compliance pass. main This is the part most people get wrong. One agent is trivial. N agents in parallel usually means file collisions and a corrupted main branch. The fix is isolation by design - each agent on its own branch/worktree, never writing to main directly: php main |-- agent-a - feat/landing-page worktree A |-- agent-b - feat/crm-webhook worktree B +-- agent-c - feat/consent-logging worktree C Parallelism is only useful if it is safe. Safety here is structural separate branches/worktrees , not "hope the agents stay out of each other's way." This is what turns autonomous coding from a single-threaded demo into something that scales like a team. The highest-leverage step happens before any code is written . Autonomous output is only as good as the spec, so the workflow generates the spec in two passes. Phase 1 - architecture-compliant draft. A spec-writing skill produces a spec that already respects framework conventions instead of fighting them. php spec-skill - SPEC.md modules, data model, routes, events, RBAC scope Phase 2 - adversarial / "philosophical" review. A second pass deliberately hunts for hidden gaps the first draft missed before a line of code is committed. php review pass - checks: routing, caching, edge cases, failure modes, consent flow Model pairing matters here: Claude and Codex are used across the phases so the spec is both convention-compliant and stress-tested. The cost of a wrong assumption is highest at the start, so that is where the scrutiny goes. By the time code is written, the thinking is done. Agents run autonomously for hours, which exposes the real enemy of long agent sessions: context burnout . A single agent grinding a long task fills its window with history and loses coherence. The fix is hierarchical orchestration: +---------------------+ | Coordinator agent | holds the plan, delegates, keeps context lean +----------+----------+ +--------------+--------------+ v v v exec agent A exec agent B exec agent C fresh ctx fresh ctx fresh ctx The coordinator owns the map; the workers own the tasks and run with fresh, scoped context. That separation is what makes unsupervised multi-hour runs possible without the quality collapse that usually follows. Strip away the demo and three engineering principles remain: The detail that is easy to skip: compliance was not a phase, it was a property of the foundation. For anyone shipping CRM/ERP in regulated markets, that is the whole game. What is the longest you have ever let an AI agent run unsupervised? Drop it in the comments.