I gave Claude / Codex a Figma file + a PRD and asked for 5-10 React pages of a working app. Single-page output is great. Multi-page output drifts in 4 specific ways. I spent ~3 months building a harness with 14 gates × auto-retry × handoff JSON to stop the drift. 10 demos, 54 screens, 4 unrelated business domains, build-green rate 100%.
Code: https://github.com/JiuwenDragon/harness-mini
Every "Figma to code with AI" demo on Twitter shows one screen. That's a real result — Claude vision is genuinely good at single-page UI. I verified this many times during my research: giving Claude a screenshot + a paragraph of PRD produces a 70-80 point page in 30 seconds.
The promise breaks at 5+ screens. Here are the 4 drift modes I measured.
| Screen 1 | Screen 2 | Screen 3 |
|---|---|---|
| Username: Zhang San | Username: Li Si | Username: Test User |
LLM doesn't carry a "world state" across page generations. Without explicit injection, it re-invents.
// Screen "transfer" generated:
<button onClick={() => router.push("/banking/home")}> // ← /banking
// Screen "home" actually at:
app/bank/home/page.tsx // ← /bank
Single-page review never catches this. Click-through breaks.
A zustand store with 5 keys (user, balance, lastTx, recent[], selected). LLM forgets 2-3 keys on screen 4, makes new ones up. Same business concept, three different variable names.
> Codex: All 10 pages generated, ready to preview.
> me: npm run build
> 3 pages: red. 2 pages: empty <div /> stubs. 1 page: import path wrong.
This one is the most painful. Without an external check, "claimed done" ≠ done.
Figma + PRD
↓ intake (fixture split)
↓ contract (frozen spec)
↓ generate (codex / claude / gemini)
↓ 14 gates (semantic / PRD / spec / UI hygiene / build / cross-canvas)
↓ visual review (human)
↓ web-preview (clickable)
Each gate is scoped to one constraint. Why? See Constraint Decay paper (arXiv 2605.06445): stuffing 10+ constraints into one prompt drops LLM performance by 30 percentage points.
The retry loop: when a gate fails, the gate's structured error report (not a vague "try again") is fed back to the LLM. Reflexion-style.
The handoff: each stage emits *_status.json
so a new operator (or a new LLM session) can pick up without reading the conversation.
Constraint Decay (arXiv 2605.06445) measured the drop directly.
Lost in the Middle (arXiv 2307.03172) shows the LLM ignores constraints buried in long prompts.
So I push one check per gate, max ~3 constraints per LLM round.
| Domain | Color | Screens | Build pass |
|---|---|---|---|
| Banking | Deep red | 10 | 10/10 |
| Fitness | Orange | 3 | 3/3 |
| Travel | Blue | 3 | 3/3 |
| Shoes | Black | 3 | 3/3 |
Same 14 gates. Same Codex/Claude/Gemini providers swapped via contract. No per-domain prompt tuning.
| Tool | Strength | Why it's not what I needed |
|---|---|---|
| Builder.io Visual Copilot | 2M+ training data, Mitosis IR | SaaS, no PRD dim, no audit trail |
| Locofy LDM | Large Design Model | SaaS, design system requires strict Auto Layout |
| Figma Make | Highest fidelity (EPAM benchmark) | |
| No public API, browser-only, $16/mo seat | ||
| v0 (Vercel) | Tight shadcn/Next.js | Figma link silently downgrades to screenshot (loses metadata) |
These are all great for "single dev makes a pretty page." None give me multi-page consistency + PRD enforcement + audit log + on-prem + provider swap, which is the actual enterprise need.
https://github.com/JiuwenDragon/harness-mini
scripts/
MIT license (I should add the file — open to PR).
Happy to answer questions in comments. The most useful feedback would be: "what other drift modes have you seen at >5 pages."