How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow A developer has shipped over 2,500 commits across multiple projects using a refined 12-phase workflow that forces AI agents to document intent before writing code. The workflow, which combines a coding agent discipline framework with human-in-the-loop gatekeeping, has been used to build two public projects—free-context-hub and lore-weave—almost entirely through AI agents, with the developer acting as gatekeeper to approve specs and review diffs. A practitioner's account — not a tutorial, not a sales pitch. Quick screen: if you're writing throwaway scripts or solo prototypes, this workflow is overkill — skip to the Cons and Who This Is For sections first. I've been using a 12-phase workflow I've refined over time — across free-context-hub https://github.com/letuhao/free-context-hub , lore-weave https://github.com/letuhao/lore-weave , and a handful of private internal systems. Both public projects are built almost entirely by AI agents, with me acting as the gatekeeper — approving specs, reviewing diffs, unblocking decisions. Across all of them, the workflow has accumulated 2,500+ commits and a trail of written specs and audit logs I can still query months after the sessions that produced them. free-context-hub is a self-hosted persistent memory and semantic search layer for AI agents — MCP server, REST API, RAG pipelines, and a full Next.js review UI. 15 development phases delivered end-to-end. lore-weave is a cloud-hosted multi-agent platform for multilingual novel workflows: translation, knowledge graph construction, glossary management, and AI-assisted writing. 19 microservices across Go, Python, and TypeScript. I'm sharing the workflow because it's worked better than anything else I've tried, and because the honest trade-offs are worth knowing before you adopt it. The files are in the repository: WORKFLOW.md CLAUDE.md.snippet AMAW.md AI coding assistants are very good at generating plausible-looking code. They're much worse at: The standard advice is "just review the diff." But reviewing a diff without having tracked the intent of the change is almost useless — you're comparing code to code, not code to requirements. The 12-phase workflow forces intent to be written down before the first line of code is written, which is what makes the diff review actually meaningful. The workflow is an evolution of two ideas: Superpowers — a coding agent discipline framework that introduced TDD protocol, the evidence gate run verification fresh before claiming success , and the debugging protocol no fix without root cause . I absorbed these directly. If you haven't read Superpowers, it's worth your time. Human-in-the-loop gatekeeping — my own addition. The core insight: a human reading a short spec + a single diff catches dramatically more than a human reading code cold. The workflow structures every task to produce exactly those artifacts, at exactly the right moment. The combination took multiple iterations to stabilize. What's here is v2.2 default mode with an optional AMAW Autonomous Multi-Agent Workflow extension for high-stakes work. Phase │ Role default v2.2 │ What Happens ───────────────┼───────────────────────┼────────────────────────────────────────── 1. CLARIFY │ Architect + Human │ Read context, write spec, expose assumptions 2. DESIGN │ Lead │ API contract / data flow → DESIGN.md 3. REVIEW │ Adversarial self │ Find gaps / contract holes in spec 4. PLAN │ Lead + Developer │ Decompose into 2–5 min tasks → PLAN.md 5. BUILD │ Developer │ TDD: red → green → refactor 6. VERIFY │ Developer │ Run tests fresh, capture exit code + output 7. REVIEW │ Lead │ Code vs spec — find exactly 3 divergences 8. QC │ Main session │ Spec fingerprint vs implementation, AC coverage 9. POST-REVIEW │ Human checkpoint │ Final gate — blocked on any unresolved issue 10. SESSION │ Scribe │ SESSION PATCH.md + DEFERRED.md + AUDIT LOG 11. COMMIT │ Developer │ Git commit 12. RETRO │ All │ Record lessons + finalize audit log The phases look heavy on paper. In practice, for an XS task single file, one logic change, no side effects you're allowed to skip CLARIFY and PLAN and go straight to BUILD — the workflow is explicit about this via a mandatory task size classification step. Before any work starts, you count three things: | Metric | What you count | |---|---| Files touched | How many files will be created or modified? | Logic changes | How many functions/handlers change behavior? not formatting | Side effects | API contract, DB schema, config, external behavior, types used by other files? | | Size | Files | Logic | Side effects | Allowed skips | |---|---|---|---|---| XS | 1 | 0–1 | None | CLARIFY + PLAN | S | 1–2 | 2–3 | None | PLAN only | M | 3–5 | 4+ | Maybe | None | L | 6+ | Any | Yes | None | XL | 10+ | Any | Yes | None | You state the classification explicitly before work begins: Task: Fix pagination off-by-one Size: XS 1 file: src/api/routes/lessons.ts, 1 logic change: offset calc, 0 side effects Skipping: CLARIFY, PLAN → straight to BUILD The hard rule: if you haven't read the code yet, you don't know the size. Agents routinely call things XS that turn out to be M or L once you look. The classification forces the read to happen before the label is applied. Every popular AI workflow has phases that agents skip "to save time." This workflow makes the skip patterns explicit and calls them violations: | Skip pattern | Why agents do it | Why it's forbidden | |---|---|---| | Skip CLARIFY, jump to BUILD | "Task seems obvious" | Unexamined assumptions cause rework | | Skip PLAN, jump to BUILD | "It's a small change" | Small changes grow; no plan = no checkpoint | | Skip VERIFY after BUILD | "Tests passed earlier" | Stale results are not evidence | | Skip REVIEW after VERIFY | "I wrote it, I know it's correct" | Author blindness is real | | Skip POST-REVIEW | "I reviewed in phase 7" | Phase 7 is code review; POST-REVIEW is the final conservative gate — different scope | | Skip SESSION before COMMIT | "I'll update later" | You won't. Context is lost. | | Combine multiple phases | "CLARIFY+DESIGN+PLAN in one go" | Each phase boundary is a deliberate pause point; skipping it removes the checkpoint | Naming these patterns and treating them as violations changes the conversation. When the agent tries to jump phases, you have a handle to point at. Phase 6 VERIFY has a 5-step gate that runs before any completion claim: Red flags — stop immediately if you catch yourself: This sounds obvious. It is not obvious when you're deep in a session and the previous test run was 20 minutes ago. In v2.2 default mode , there are two mandatory human checkpoints: These are not optional. The whole model is that the human reads a short spec, not a long codebase. The AI builds the spec; the human approves it; the AI builds the code against the approved spec. The POST-REVIEW diff is then code-vs-approved-spec, which is a comparison a human can actually do. For high-stakes work — data migrations, new service boundaries, security-critical paths — there's an optional extension: AMAW Autonomous Multi-Agent Workflow . In AMAW mode, cold-start sub-agents replace or augment the human review gates: The key insight is cold-start : each agent is spawned fresh with only file access. It cannot inherit the main session's context rot or biases. It reads what's written; it can't be influenced by what was discussed in chat. Note:AMAW removes the human from all review gates — including POST-REVIEW, which is held by the Scope Guard instead. At CLARIFY, rather than a human approving the spec, the Adversary challenges it at the next phase. In practice this means AMAW sessions can run with minimal human interaction, but they still require a human to kick off the task and review the final audit log. Pure fire-and-forget is not the design intent. AMAW costs roughly $1–5 in sub-agent tokens and ~30 extra minutes per task. I use it for schema migrations and multi-system contracts. For everyday work, the human-in-loop default catches the same issues faster and cheaper. Every phase transition and agent verdict appends to docs/audit/AUDIT LOG.jsonl — one JSON line per event: {"ts":"2026-05-15T17:42:00Z","task":"phase-14-model-swap","phase":"review-design","agent":"adversary","action":"review","status":"REJECTED","findings count":3,"block count":2,"warn count":1,"note":"..."} Append-only. Never modified. Main session and sub-agents both write to it, never delete or edit existing lines. This becomes the durable record of what was decided and why — something that doesn't exist in most AI coding setups where everything lives in ephemeral chat. On free-context-hub https://github.com/letuhao/free-context-hub I've delivered 15 development phases covering: On lore-weave https://github.com/letuhao/lore-weave I've delivered 5 full vertical modules and am mid-way through a sixth, accumulating 1,497 commits since March 2026 across 19 microservices. The modules completed so far cover: The current Phase 6 work spans usage-billing and a hierarchical book extraction engine — the kind of multi-service, cross-cutting work where the workflow's cross-phase checkpoints earn their keep. That's 400+ commits on free-context-hub and 1,497 on lore-weave — the rest comes from private team projects also running this workflow — totaling 2,500+ commits with a live audit trail I can query across sessions that ran months apart. The hardest part was Phase 10 SESSION — keeping the session patch updated after every sprint without skipping it. Once that became a habit, sessions started to feel continuous rather than amnesia-punctuated. You understand your own system deeply. Because you write the spec and approve it, you can't hide behind "the AI built it." You actually know what was built and why the trade-offs were made. This is the biggest practical advantage for me — not velocity, but comprehension. Architectural decisions have a paper trail. Every trade-off is in a spec file that was approved before code was written. When a future session revisits a design choice, the rationale is readable, not reconstructed from diff archaeology. Context drift is visible. When an AI starts building something that wasn't in the spec, the spec fingerprint comparison at POST-REVIEW catches it. Without a written spec, you'd never notice until integration time. Deferred items don't get lost. The workflow forces any "we'll do this later" to be written in DEFERRED.md with a specific trigger condition. Nothing lives only in chat — chat is ephemeral, files are truth. It's incrementally adoptable. You can start with just CLARIFY + VERIFY and get substantial value. Add phases as your trust in the workflow grows. Token usage is genuinely high. Each phase generates artifacts: spec files, plan files, audit events. AMAW mode multiplies this by spawning sub-agents. A single M-sized task with AMAW can burn 5,000–10,000 tokens before a line of code is written. At scale, this is a real budget consideration. You clarify constantly — and it takes real time. Phase 1 CLARIFY is not a quick preamble. For any task with real ambiguity — architecture decisions, new API contracts, trade-off calls — you're in a back-and-forth that can run 20–40 minutes before design starts. At a medium-sized project cadence 10–20 above-XS tasks per sprint , this adds up to multiple hours per sprint spent purely on scoping. This is actually the point of the workflow, but if you're used to "just build it," the overhead feels significant early on. Human approval gates limit automation. Every architecture decision, trade-off, and scope call requires your explicit approval. You cannot queue up a batch of tasks and walk away. If you need fully autonomous overnight runs, this workflow is the wrong tool. The discipline needs enforcement tooling to hold. Left to their own devices, agents will skip phases. The workflow holds together because of workflow-gate.sh a pre-commit gate that blocks commits if VERIFY and SESSION aren't done and the append-only AUDIT LOG.jsonl . If you copy docs/WORKFLOW.md into your project without also setting up the enforcement layer, expect phases to get skipped within a few sessions. The tooling is in the repository — it's not hidden — but it's a real setup step, not just copy-paste. Cold-start sub-agents AMAW only miss things said in chat. Because each AMAW sub-agent reads files from scratch, anything that was decided verbally in the session but never written to a file is invisible to them. This is a feature for preventing bias, but it means you must be disciplined about writing things down as you go. The Scribe sub-agent helps, but it can only record what's already in files. Worth the overhead if: Overkill if: The workflow is designed for the first category. Using it for the second is just friction. All workflow files live in the agentic-workflow/ https://github.com/letuhao/free-context-hub/tree/main/agentic-workflow folder of the free-context-hub repository. Start with the template: WORKFLOW.md into your project root or paste the relevant sections into your CLAUDE.md / agent instructions — this is the full 12-phase spec CUSTOMIZE sections for your stack verification commands, test runner, any MCP tools you use — MCP is the Model Context Protocol, an interface for giving AI agents access to external tools and knowledge stores; the workflow works without it workflow-gate.sh from the same folder to enforce the phase gates mechanically — without this, agents will skip phases amaw-workflow.md for the AMAW multi-agent extensionThe workflow is model-agnostic. I use it with Claude Code but nothing in the spec requires it. The 12-phase workflow is not magic. It's a way of making explicit things that were always implicit: what are we building, how big is it, what's the verification evidence, who approved it, what did we learn? The AI does most of the work. The human stays in control of the decisions that actually matter. The cost is real — more tokens, more time spent clarifying, more things requiring your approval before the AI proceeds. The benefit is also real: you end up with a system you understand deeply, and a trail of why it was built the way it was. For me, after 2,500+ commits across multiple projects, that trade-off is still worth it. Repositories: letuhao/free-context-hub · letuhao/lore-weave WORKFLOW.md AMAW.md CLAUDE.md