# How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow

> Source: <https://dev.to/letuhao/how-i-shipped-2500-commits-with-ai-agents-using-a-12-phase-workflow-4ap4>
> Published: 2026-05-25 15:19:30+00:00

*A practitioner's account — not a tutorial, not a sales pitch.*

**Quick screen:** if you're writing throwaway scripts or solo prototypes, this workflow is overkill — skip to the Cons and Who This Is For sections first.

I've been using a 12-phase workflow I've refined over time — across [free-context-hub](https://github.com/letuhao/free-context-hub), [lore-weave](https://github.com/letuhao/lore-weave), and a handful of private internal systems. Both public projects are built almost entirely by AI agents, with me acting as the gatekeeper — approving specs, reviewing diffs, unblocking decisions. Across all of them, the workflow has accumulated 2,500+ commits and a trail of written specs and audit logs I can still query months after the sessions that produced them.

**free-context-hub** is a self-hosted persistent memory and semantic search layer for AI agents — MCP server, REST API, RAG pipelines, and a full Next.js review UI. 15 development phases delivered end-to-end.

**lore-weave** is a cloud-hosted multi-agent platform for multilingual novel workflows: translation, knowledge graph construction, glossary management, and AI-assisted writing. 19 microservices across Go, Python, and TypeScript.

I'm sharing the workflow because it's worked better than anything else I've tried, and because the honest trade-offs are worth knowing before you adopt it.

The files are in the repository:

`WORKFLOW.md`

`CLAUDE.md.snippet`

`AMAW.md`

AI coding assistants are very good at generating plausible-looking code. They're much worse at:

The standard advice is "just review the diff." But reviewing a diff without having tracked the *intent* of the change is almost useless — you're comparing code to code, not code to requirements. The 12-phase workflow forces intent to be written down before the first line of code is written, which is what makes the diff review actually meaningful.

The workflow is an evolution of two ideas:

** Superpowers** — a coding agent discipline framework that introduced TDD protocol, the evidence gate (run verification fresh before claiming success), and the debugging protocol (no fix without root cause). I absorbed these directly. If you haven't read Superpowers, it's worth your time.

**Human-in-the-loop gatekeeping** — my own addition. The core insight: a human reading a short spec + a single diff catches dramatically more than a human reading code cold. The workflow structures every task to produce exactly those artifacts, at exactly the right moment.

The combination took multiple iterations to stabilize. What's here is v2.2 (default mode) with an optional AMAW (Autonomous Multi-Agent Workflow) extension for high-stakes work.

```
Phase          │ Role (default v2.2)   │ What Happens
───────────────┼───────────────────────┼──────────────────────────────────────────
1. CLARIFY     │ Architect + Human     │ Read context, write spec, expose assumptions
2. DESIGN      │ Lead                  │ API contract / data flow → DESIGN.md
3. REVIEW      │ Adversarial self      │ Find gaps / contract holes in spec
4. PLAN        │ Lead + Developer      │ Decompose into 2–5 min tasks → PLAN.md
5. BUILD       │ Developer             │ TDD: red → green → refactor
6. VERIFY      │ Developer             │ Run tests fresh, capture exit code + output
7. REVIEW      │ Lead                  │ Code vs spec — find exactly 3 divergences
8. QC          │ Main session          │ Spec fingerprint vs implementation, AC coverage
9. POST-REVIEW │ Human checkpoint      │ Final gate — blocked on any unresolved issue
10. SESSION    │ Scribe                │ SESSION_PATCH.md + DEFERRED.md + AUDIT_LOG
11. COMMIT     │ Developer             │ Git commit
12. RETRO      │ All                   │ Record lessons + finalize audit log
```

The phases look heavy on paper. In practice, for an XS task (single file, one logic change, no side effects) you're allowed to skip CLARIFY and PLAN and go straight to BUILD — the workflow is explicit about this via a mandatory **task size classification** step.

Before any work starts, you count three things:

| Metric | What you count |
|---|---|
Files touched |
How many files will be created or modified? |
Logic changes |
How many functions/handlers change behavior? (not formatting) |
Side effects |
API contract, DB schema, config, external behavior, types used by other files? |

| Size | Files | Logic | Side effects | Allowed skips |
|---|---|---|---|---|
XS |
1 | 0–1 | None | CLARIFY + PLAN |
S |
1–2 | 2–3 | None | PLAN only |
M |
3–5 | 4+ | Maybe | None |
L |
6+ | Any | Yes | None |
XL |
10+ | Any | Yes | None |

You state the classification explicitly before work begins:

```
Task: Fix pagination off-by-one
Size: XS (1 file: src/api/routes/lessons.ts, 1 logic change: offset calc, 0 side effects)
Skipping: CLARIFY, PLAN → straight to BUILD
```

The hard rule: **if you haven't read the code yet, you don't know the size.** Agents routinely call things XS that turn out to be M or L once you look. The classification forces the read to happen before the label is applied.

Every popular AI workflow has phases that agents skip "to save time." This workflow makes the skip patterns explicit and calls them violations:

| Skip pattern | Why agents do it | Why it's forbidden |
|---|---|---|
| Skip CLARIFY, jump to BUILD | "Task seems obvious" | Unexamined assumptions cause rework |
| Skip PLAN, jump to BUILD | "It's a small change" | Small changes grow; no plan = no checkpoint |
| Skip VERIFY after BUILD | "Tests passed earlier" | Stale results are not evidence |
| Skip REVIEW after VERIFY | "I wrote it, I know it's correct" | Author blindness is real |
| Skip POST-REVIEW | "I reviewed in phase 7" | Phase 7 is code review; POST-REVIEW is the final conservative gate — different scope |
| Skip SESSION before COMMIT | "I'll update later" | You won't. Context is lost. |
| Combine multiple phases | "CLARIFY+DESIGN+PLAN in one go" | Each phase boundary is a deliberate pause point; skipping it removes the checkpoint |

Naming these patterns and treating them as violations changes the conversation. When the agent tries to jump phases, you have a handle to point at.

Phase 6 (VERIFY) has a 5-step gate that runs before any completion claim:

Red flags — stop immediately if you catch yourself:

This sounds obvious. It is not obvious when you're deep in a session and the previous test run was 20 minutes ago.

In v2.2 (default mode), there are two mandatory human checkpoints:

These are not optional. The whole model is that the human reads a short spec, not a long codebase. The AI builds the spec; the human approves it; the AI builds the code against the approved spec. The POST-REVIEW diff is then code-vs-approved-spec, which is a comparison a human can actually do.

For high-stakes work — data migrations, new service boundaries, security-critical paths — there's an optional extension: **AMAW (Autonomous Multi-Agent Workflow)**. In AMAW mode, cold-start sub-agents replace or augment the human review gates:

The key insight is **cold-start**: each agent is spawned fresh with only file access. It cannot inherit the main session's context rot or biases. It reads what's written; it can't be influenced by what was discussed in chat.

Note:AMAW removes the human from all review gates — including POST-REVIEW, which is held by the Scope Guard instead. At CLARIFY, rather than a human approving the spec, the Adversary challenges it at the next phase. In practice this means AMAW sessions can run with minimal human interaction, but they still require a human to kick off the task and review the final audit log. Pure fire-and-forget is not the design intent.

AMAW costs roughly $1–5 in sub-agent tokens and ~30 extra minutes per task. I use it for schema migrations and multi-system contracts. For everyday work, the human-in-loop default catches the same issues faster and cheaper.

Every phase transition and agent verdict appends to `docs/audit/AUDIT_LOG.jsonl`

— one JSON line per event:

```
{"ts":"2026-05-15T17:42:00Z","task":"phase-14-model-swap","phase":"review-design","agent":"adversary","action":"review","status":"REJECTED","findings_count":3,"block_count":2,"warn_count":1,"note":"..."}
```

Append-only. Never modified. Main session and sub-agents both write to it, never delete or edit existing lines.

This becomes the durable record of what was decided and why — something that doesn't exist in most AI coding setups where everything lives in ephemeral chat.

On [free-context-hub](https://github.com/letuhao/free-context-hub) I've delivered 15 development phases covering:

On [lore-weave](https://github.com/letuhao/lore-weave) I've delivered 5 full vertical modules and am mid-way through a sixth, accumulating 1,497 commits since March 2026 across 19 microservices. The modules completed so far cover:

The current Phase 6 work spans usage-billing and a hierarchical book extraction engine — the kind of multi-service, cross-cutting work where the workflow's cross-phase checkpoints earn their keep.

That's 400+ commits on free-context-hub and 1,497 on lore-weave — the rest comes from private team projects also running this workflow — totaling 2,500+ commits with a live audit trail I can query across sessions that ran months apart.

The hardest part was Phase 10 (SESSION) — keeping the session patch updated after every sprint without skipping it. Once that became a habit, sessions started to feel continuous rather than amnesia-punctuated.

**You understand your own system deeply.** Because you write the spec and approve it, you can't hide behind "the AI built it." You actually know what was built and why the trade-offs were made. This is the biggest practical advantage for me — not velocity, but comprehension.

**Architectural decisions have a paper trail.** Every trade-off is in a spec file that was approved before code was written. When a future session revisits a design choice, the rationale is readable, not reconstructed from diff archaeology.

**Context drift is visible.** When an AI starts building something that wasn't in the spec, the spec fingerprint comparison at POST-REVIEW catches it. Without a written spec, you'd never notice until integration time.

**Deferred items don't get lost.** The workflow forces any "we'll do this later" to be written in `DEFERRED.md`

with a specific trigger condition. Nothing lives only in chat — chat is ephemeral, files are truth.

**It's incrementally adoptable.** You can start with just CLARIFY + VERIFY and get substantial value. Add phases as your trust in the workflow grows.

**Token usage is genuinely high.** Each phase generates artifacts: spec files, plan files, audit events. AMAW mode multiplies this by spawning sub-agents. A single M-sized task with AMAW can burn 5,000–10,000 tokens before a line of code is written. At scale, this is a real budget consideration.

**You clarify constantly — and it takes real time.** Phase 1 (CLARIFY) is not a quick preamble. For any task with real ambiguity — architecture decisions, new API contracts, trade-off calls — you're in a back-and-forth that can run 20–40 minutes before design starts. At a medium-sized project cadence (10–20 above-XS tasks per sprint), this adds up to multiple hours per sprint spent purely on scoping. This is actually the point of the workflow, but if you're used to "just build it," the overhead feels significant early on.

**Human approval gates limit automation.** Every architecture decision, trade-off, and scope call requires your explicit approval. You cannot queue up a batch of tasks and walk away. If you need fully autonomous overnight runs, this workflow is the wrong tool.

**The discipline needs enforcement tooling to hold.** Left to their own devices, agents will skip phases. The workflow holds together because of `workflow-gate.sh`

(a pre-commit gate that blocks commits if VERIFY and SESSION aren't done) and the append-only `AUDIT_LOG.jsonl`

. If you copy `docs/WORKFLOW.md`

into your project without also setting up the enforcement layer, expect phases to get skipped within a few sessions. The tooling is in the repository — it's not hidden — but it's a real setup step, not just copy-paste.

**Cold-start sub-agents (AMAW only) miss things said in chat.** Because each AMAW sub-agent reads files from scratch, anything that was decided verbally in the session but never written to a file is invisible to them. This is a feature for preventing bias, but it means you must be disciplined about writing things down as you go. The Scribe sub-agent helps, but it can only record what's already in files.

Worth the overhead if:

Overkill if:

The workflow is designed for the first category. Using it for the second is just friction.

All workflow files live in the [ agentic-workflow/](https://github.com/letuhao/free-context-hub/tree/main/agentic-workflow) folder of the free-context-hub repository.

**Start with the template:**

`WORKFLOW.md`

into your project root or paste the relevant sections into your `CLAUDE.md`

/ agent instructions — this is the full 12-phase spec`[CUSTOMIZE]`

sections for your stack (verification commands, test runner, any MCP tools you use — MCP is the Model Context Protocol, an interface for giving AI agents access to external tools and knowledge stores; the workflow works without it)`workflow-gate.sh`

from the same folder to enforce the phase gates mechanically — without this, agents will skip phases`amaw-workflow.md`

for the AMAW multi-agent extensionThe workflow is model-agnostic. I use it with Claude Code but nothing in the spec requires it.

The 12-phase workflow is not magic. It's a way of making explicit things that were always implicit: what are we building, how big is it, what's the verification evidence, who approved it, what did we learn? The AI does most of the work. The human stays in control of the decisions that actually matter.

The cost is real — more tokens, more time spent clarifying, more things requiring your approval before the AI proceeds. The benefit is also real: you end up with a system you understand deeply, and a trail of why it was built the way it was.

For me, after 2,500+ commits across multiple projects, that trade-off is still worth it.

*Repositories: letuhao/free-context-hub · letuhao/lore-weave*

`WORKFLOW.md`

`AMAW.md`

`CLAUDE.md`
