The Grilling A developer has identified a structural blind spot in 16 spec-driven AI frameworks: none check the specification against attack before it is written. The developer built a process called "Grilling" as Phase 2 of the Heist Pipeline, which uses three subagents in a structured interrogation to determine whether a proposed solution should exist at all, rather than merely debating the best version of a chosen position. Unlike standard multi-agent debate, Grilling continues until both attacking and defending pressures are simultaneously exhausted, with the legitimate output being the ability to "kill the idea entirely. In Part 1 https://dev.to/kucherenko/16-frameworks-one-blind-spot-20cg I argued that every spec-driven AI framework on the market - sixteen of them in my survey - has the same structural blind spot. They all check the implementation against the spec. None of them check the spec against attack before it gets written. Part 2 is the operational deep dive. What does the missing phase actually look like when you build it? How does it run? What are the agents, the prompts, the termination conditions, the artifacts? When should younotuse it? This part assumes you’ve read Part 1, or at least bought the premise: the spec needs to be on trial before it becomes gospel. A few research papers and a couple of frameworks have something they call multi-agent debate . Two agents argue, a third synthesizes. This is a real technique with real research behind it, and it’s a meaningful improvement over single-agent reasoning. It is not Grilling. The differences matter, and they’re worth being precise about. Most debate setups in current frameworks operate on whatever’s in the prompt - they don’t first survey the codebase, the existing tests, the past failures, the applicable constraints. The result is two LLMs hallucinating at each other politely. The Advocate invents objections that don’t apply to the actual system; the Proposer defends positions against attacks that wouldn’t matter even if they landed. Without a Recon Dossier in front of both agents, the debate is theater. It produces dialogue, not decisions. Grilling refuses to start until the ground truth is established and verified. That’s not a stylistic choice - it’s the only way the attacks have weight. Standard debate optimizes for the best version of a chosen position . Two agents start with opposing views and the synthesizer extracts what’s strongest from each. This is genuinely useful when you’ve already decided to do something and you’re trying to figure out the best way. Grilling optimizes for a different thing entirely: whether the position should be held at all. The Proposer isn’t defending a position because it was assigned to them; they’re proposing a solution they actually think is correct, and the Advocate is trying to dismantle that proposal. The legitimate output of Grilling is kill the idea entirely . The legitimate output of standard debate is rarely neither side has a point. And this might be the most important one. Standard debate ends when both sides have made their case - typically after a fixed number of rounds, or when the orchestrator decides the discussion has matured. That’s a procedural ending, not a substantive one. The debate stops because the schedule says it stops , not because the question has been resolved. Grilling has a structural stopping condition: equilibrium between two opposing pressures. The attacker has nothing left. The Don has nothing left. Both pressures simultaneously exhausted. Until that condition is met, the rounds continue up to the hard ceiling . After that condition is met, no more rounds - they’d add nothing. The stopping condition is the whole game. If your debate stops on we’re done arguing , you’re polishing turds - you exit with whatever the agents converged on, regardless of whether what they converged on was correct. If it stops on no new valid objection AND no remaining concerns , you have something stronger: a verdict that survived attack, with the surviving objections explicitly logged. Multi-agent debate is a useful tool. It’s just a different tool, solving a different problem. Grilling sits as Phase 2 of the Heist Pipeline. It’s not a prompt and it’s not a standalone tool - it’s a phase with hard gates before it Reconnaissance must complete and produce a Recon Dossier and after it the Don must sign off on the verdict before anything moves to the Sit-Down . The process runs like a structured interrogation. Three subagents have specific roles, the Don the user participates in every round, and the rounds follow a fixed order. The Proposer opens. It reads the Recon Dossier - the verified findings from Phase 1 - and proposes a solution: architecture, file changes, identified risks, expected behavior. The Proposer’s job is to put the strongest possible version of the idea on the table. Not the safest version, not the most diplomatic version. The strongest. If the idea is bad, you want it to die fighting, not die mumbling. The Devil’s Advocate attacks. Architectural flaws. Security gaps. Constitution violations. Performance regressions. Scalability ceilings. Edge cases the Proposer didn’t think about. The Devil’s Advocate’s job - and this is important - is to find the failure mode. Not to be polite. Not to suggest improvements. To attack. If the Proposer says “we’ll cache this in Redis,” the Devil’s Advocate says: What happens when Redis is down? What happens when the cache is poisoned? Have you measured the actual cache hit rate or are you guessing? Bad attacks get filtered by the Proposer’s response. Good attacks force a revision. The Don - that’s the user, you - weighs in every round. One question at a time. Never bundled. This rule matters more than it sounds. If the Don asks three questions at once, the agent will answer the easy one fully, the medium one partially, and quietly skip the hard one. One question forces an actual answer. The Don’s questions are usually the most valuable in the whole Grilling, because the Don has context the agents don’t have - about the team, about the business, about the politics, about what’s been tried before that didn’t make it into the codebase. The Synthesizer closes each round. It incorporates the valid attacks and the Don’s feedback and produces a revised solution. Not a defense of the original - a revision. If nothing valid came up that round, the revision is small. If something hit hard, the revision is structural. Sometimes the revision is kill this idea entirely and propose a different approach , and that’s a legitimate outcome. Then the next round begins. Theory is cheap. Here’s a real Round 1 , lightly edited for length, from a Grilling session on a small feature : adding a local high-score leaderboard to a browser Tetris game . The task sounds trivial. Watch how fast “ trivial ” falls apart under attack. The Proposer opens . It reads the Recon Dossier - a three-layer pure state machine, zero DOM in the logic layer, 261 passing tests, a constitution whose first commandment is game logic has zero DOM dependencies — and puts a solution on the table: Add a name-entry status to the state machine. Store the typed initials in state.nameEntry = { buffer: '', maxLen: 3 } . Detect game-over by reading the gameOver event from state.events , then route keystrokes into the buffer. Render the leaderboard as a DOM