{"slug": "the-grilling", "title": "The Grilling", "summary": "A developer has identified a structural blind spot in 16 spec-driven AI frameworks: none check the specification against attack before it is written. The developer built a process called \"Grilling\" as Phase 2 of the Heist Pipeline, which uses three subagents in a structured interrogation to determine whether a proposed solution should exist at all, rather than merely debating the best version of a chosen position. Unlike standard multi-agent debate, Grilling continues until both attacking and defending pressures are simultaneously exhausted, with the legitimate output being the ability to \"kill the idea entirely.", "body_md": "In [Part 1](https://dev.to/kucherenko/16-frameworks-one-blind-spot-20cg) I argued that every spec-driven AI framework on the market - sixteen of them in my survey - has the same structural blind spot. They all check the implementation against the spec. None of them check the spec against attack before it gets written.\n\n**Part 2 is the operational deep dive.**\n\nWhat does the missing phase actually look like when you build it?\n\nHow does it run?\n\nWhat are the agents, the prompts, the termination conditions, the artifacts?\n\nWhen should younotuse it?\n\n*This part assumes you’ve read Part 1, or at least bought the premise: the spec needs to be on trial before it becomes gospel.*\n\nA few research papers and a couple of frameworks have something they call **multi-agent debate**. Two agents argue, a third synthesizes. This is a real technique with real research behind it, and it’s a meaningful improvement over single-agent reasoning.\n\nIt is not Grilling.\n\nThe differences matter, and they’re worth being precise about.\n\nMost debate setups in current frameworks operate on whatever’s in the prompt - they don’t first survey the codebase, the existing tests, the past failures, the applicable constraints. **The result is two LLMs hallucinating at each other politely.**\n\n**The Advocate** invents objections that don’t apply to the actual system; **the Proposer** defends positions against attacks that wouldn’t matter even if they landed. Without **a Recon Dossier** in front of both agents, the debate is theater. It produces dialogue, not decisions.\n\nGrilling refuses to start until the ground truth is established and verified. That’s not a stylistic choice - it’s the only way the attacks have weight.\n\nStandard debate optimizes for **the best version of a chosen position**. Two agents start with opposing views and the synthesizer extracts what’s strongest from each. This is genuinely useful when you’ve already decided to do **something** and you’re trying to figure out the best way.\n\nGrilling optimizes **for a different thing entirely: whether the position should be held at all.** The Proposer isn’t defending a position because it was assigned to them; they’re proposing a solution they actually think is correct, and **the Advocate** is trying to dismantle that proposal.\n\nThe legitimate output of Grilling is **kill the idea entirely**. The legitimate output of standard debate is rarely **neither side has a point.**\n\nAnd this might be the most important one. Standard debate ends when both sides have made their case - typically after a fixed number of rounds, or when the orchestrator decides the discussion has matured. That’s a procedural ending, not a substantive one. The debate stops because the **schedule says it stops**, not because the question has been resolved.\n\nGrilling has a structural stopping condition: equilibrium between two opposing pressures. The attacker has nothing left. The Don has nothing left. Both pressures simultaneously exhausted. Until that condition is met, the rounds continue (up to the hard ceiling). After that condition is met, no more rounds - they’d add nothing.\n\nThe stopping condition is the whole game.\n\nIf your debate stops on **we’re done arguing**, you’re polishing turds - you exit with whatever the agents converged on, regardless of whether what they converged on was correct.\n\nIf it stops on **no new valid objection AND no remaining concerns**, you have something stronger: a verdict that survived attack, with the surviving objections explicitly logged.\n\nMulti-agent debate is a useful tool. It’s just a different tool, solving a different problem.\n\nGrilling sits as Phase 2 of the Heist Pipeline. It’s not a prompt and it’s not a standalone tool - it’s **a phase with hard gates before it** (Reconnaissance must complete and produce a Recon Dossier) and after it (the Don must sign off on the verdict before anything moves to the Sit-Down).\n\nThe process runs like a structured interrogation. Three subagents have specific roles, the Don (the user) participates in every round, and the rounds follow a fixed order.\n\n**The Proposer** opens. It reads the Recon Dossier - the verified findings from Phase 1 - and proposes a solution: architecture, file changes, identified risks, expected behavior.\n\nThe Proposer’s job is to put the strongest possible version of the idea on the table. Not the safest version, not the most diplomatic version. The strongest. If the idea is bad, you want it to die fighting, not die mumbling.\n\n**The Devil’s Advocate** attacks. Architectural flaws. Security gaps. Constitution violations. Performance regressions. Scalability ceilings. Edge cases the Proposer didn’t think about. The Devil’s Advocate’s job - and this is important - is **to find the failure mode.**\n\nNot to be polite.\n\nNot to suggest improvements.\n\nTo attack.\n\nIf the Proposer says “we’ll cache this in Redis,” the Devil’s Advocate says:\n\nWhat happens when Redis is down?\n\nWhat happens when the cache is poisoned?\n\nHave you measured the actual cache hit rate or are you guessing?\n\nBad attacks get filtered by the Proposer’s response.\n\nGood attacks force a revision.\n\n**The Don** - that’s the user, you - weighs in every round. One question at a time. Never bundled. This rule matters more than it sounds.\n\nIf the Don asks three questions at once, the agent will answer the easy one fully, the medium one partially, and quietly skip the hard one. One question forces an actual answer. The Don’s questions are usually the most valuable in the whole Grilling, because the Don has context the agents don’t have - about the team, about the business, about the politics, about what’s been tried before that didn’t make it into the codebase.\n\n**The Synthesizer** closes each round. It incorporates the valid attacks and the Don’s feedback and produces a revised solution. Not a defense of the original - a revision. If nothing valid came up that round, the revision is small. If something hit hard, the revision is structural. Sometimes the revision is **kill this idea entirely and propose a different approach**, and that’s a legitimate outcome.\n\nThen the next round begins.\n\nTheory is cheap.\n\n**Here’s a real Round 1**, lightly edited for length, from a **Grilling session on a small feature**: adding a local high-score leaderboard to **a browser Tetris game**.\n\nThe task sounds trivial. Watch how fast “**trivial**” falls apart under attack.\n\n**The Proposer opens**. It reads the **Recon Dossier** - a three-layer pure state machine, zero DOM in the logic layer, 261 passing tests, a constitution whose first commandment is **game logic has zero DOM dependencies** — and puts a solution on the table:\n\nAdd a\n\n`name-entry`\n\nstatus to the state machine. Store the typed initials in`state.nameEntry = { buffer: '', maxLen: 3 }`\n\n. Detect game-over by reading the`gameOver`\n\nevent from`state.events[]`\n\n, then route keystrokes into the buffer. Render the leaderboard as a DOM`<div>`\n\noverlay, mirroring the existing`settings.js`\n\n. A pure`isTopTen(score, entries)`\n\nfunction decides qualification.\n\nReasonable on its face. Mirrors existing patterns. Names real files. This is the **strongest** version, not the safest — exactly what the Proposer is supposed to deliver.\n\n**The Devil’s Advocate attacks.** It doesn’t suggest improvements. It goes for failure modes, citing the Dossier as ammunition:\n\nAttack 1—`state.nameEntry`\n\nviolates Constitution Commandment 1. The name buffer is pure UI interaction state. It has zero effect on scoring, physics, or piece movement. Putting it in`state.js`\n\npollutes the logic layer that the entire test suite depends on being DOM-free. It belongs in`main.js`\n\nas local orchestration state.Evidence:`state.js`\n\nis the logic layer; 261 tests import it headless.\n\nAttack 2—`state.events[]`\n\nis the wrong trigger. The Dossier describes that array as existing “for renderer/audio coupling.” It does not specify whether it’s drained per-frame or persistent. If the renderer or audio handler consumes it before`main.js`\n\nreads it, the game-over trigger is silently missed. That’s a race condition waiting to happen.Evidence:Dossier—“Events array for renderer/audio coupling,” no drain semantics specified.\n\nAttack 3—key suppression is unspecified.`S`\n\n(settings),`M`\n\n(mute),`P`\n\n(pause),`R`\n\n(restart),`L`\n\n(leaderboard) all have live bindings in`main.js`\n\n. Type a name containing “S” and the settings panel opens mid-entry. The Proposer flags this as a risk but offers no mechanism.Evidence:existing.`main.js`\n\nkey handlers\n\nPlus a secondary finding the Proposer never mentioned: **names rendered from localStorage straight into the DOM are a stored XSS vector** unless the write uses `textContent`\n\ninstead of `innerHTML`\n\n.\n\nVerdict: **CHALLENGE**. The idea is sound. The solution has three holes that need closing before anyone writes a spec.\n\n**The Don weighs in** — one question at a time, never bundled.\n\nDo you agree with the attack? →\n\nAgree fully.Anything the Devil’s Advocate missed? →\n\nNothing missed.Want to override any part of the proposal before synthesis? →\n\nNo override.\n\nThree forced answers. No place to hide an evasive “well, sort of.”\n\n**The Synthesizer revises.** Not a defense of the original - a revision that absorbs every valid hit:\n\n**Remove** nameEntry **from** state.js **entirely**. The buffer be`comes let nameBuffer = ''`\n\nlocal to `main.js`\n\n. The state machine keeps only the `'name-entry'`\n\nstatus string — that genuinely `is`\n\ngame state, because it gates input routing and rendering.\n\n**Abandon** `state.events[]`\n\n**for orchestration**. `main.js`\n\ntracks prevStatus across ticks; when status flips to game-over, the transition fires. No array drain, no race. The events array keeps its existing renderer/audio role untouched.\n\n**Hard early-return guard** at the top of the `keydown`\n\nhandler: while `status === 'name-entry'`\n\n, all keys route to `handleNameEntryKey()`\n\nand nothing else can fire.\n\n**All DOM writes use** textContent. XSS closed.\n\n**One round.**\n\nThe “trivial” feature now has a clean separation between UI and logic state, a race condition designed out before it existed, and a security hole closed before a line of code was written. That’s the difference between **we agreed** and **it survived attack**.\n\nThis particular session ran a second round — the Devil’s Advocate came back with six more attacks (no max-length cap on the buffer, a pure-vs-DOM module split, JSON schema validation on load) — before reaching equilibrium. The full verdict, with every rejected alternative and every accepted risk, flowed straight into the Sit-Down as the basis for the Contract.\n\n**Hard rules on rounds.**\n\nMinimum 2 rounds - you can’t grill an idea in a single pass, because the first attack is always shallow. Default maximum 5 rounds - most ideas resolve here, either by surviving or by transforming into something different.\n\nHard ceiling 7 - the Don can extend, but not beyond, because past round 7 returns diminish sharply and you’re usually just rationalizing at that point. Early exit only after round 2, only by explicit Don call - used when convergence is genuinely fast and continuing would be theater.\n\n**Termination is not “we agreed.”**\n\nAgreement is the easiest thing in the world to manufacture between an LLM and another LLM, and between an LLM and a tired user. Termination is one of three structural conditions:\n\n**Nash Equilibrium** is the canonical one. The Devil’s Advocate raises no new valid objection AND the Don has no remaining concerns. Both attacking pressures have run out of ammunition simultaneously. The idea has genuinely survived attack - not because the attack stopped, but because the attack hit nothing that wasn’t already\n\naccounted for. This is the outcome you want.\n\n**Explicit consensus** is the fast-path. The Don ends the Grilling after round 2, declaring that the idea has been adequately tested. This is appropriate when the problem is genuinely simple, when the Recon Dossier already addressed the major risks, or when the team has high confidence from prior similar work. It’s a real exit, but it’s the Don’s call to make, not the agents’.\n\n**Round limit** is the safety valve. If neither equilibrium nor consensus is reached by round 7, the Grilling ends - but the unresolved objections don’t disappear. They get logged into the verdict as accepted risks. The Don is explicitly carrying them forward.\n\nThis matters: it means a forced termination doesn’t pretend the idea is clean. It just makes the dirt explicit. Six months later, when something breaks, you can look at the verdict and see exactly which risk was knowingly accepted.\n\nThe output of Grilling isn’t a spec. It’s a verdict - Key Decisions (and why), Rejected Alternatives (and why they were rejected), Unresolved Objections (and what risks the Don is carrying forward), and the Termination Reason.\n\nThe verdict is held in-context, not written to a file - it flows directly into the next phase, the Sit-Down, where the actual Contract gets drafted.\n\nOnly after Grilling does anything get written down as a Contract.\n\nOnly after the Contract gets signed does code get planned.\n\nOnly after the plan does code get written.\n\n**Five gates before a single line of implementation.** That sounds heavy, and on small tasks it is - which is why Grilling has explicit “skip” conditions, which I’ll get to below.\n\nBut for anything that’s actually load-bearing, the cost of skipping any of those gates is higher than the cost of running them. Always. The point of the pipeline is that the friction is **real** friction, not theatrical friction. It catches things.\n\nYes, this adds tokens. Recon plus Grilling cost real money on every feature, and on a moderate-sized change the overhead is non-trivial - I’ll publish hard numbers from instrumented runs separately. The bet is that the cost of arguing about a bad idea is always smaller than the cost of building one. So far that bet has held.\n\nI’m not going to pretend this is universal. It isn’t. Grilling is a serious tool with serious overhead, and it has clear failure modes when applied wrong.\n\n**The first failure mode** is using Grilling on changes that don’t deserve it. If the task is fixing a typo, bumping a dependency version, or renaming a variable - Recon plus 2 rounds of Grilling is absurd. You’ll spend more tokens debating the change than implementing it, and the agents will start manufacturing fake objections to fill the rounds because there genuinely aren’t real ones to raise.\n\nThe Devil’s Advocate will say something like *have we considered backwards compatibility for users who depend on this exact variable name?* and you’ll know the system has descended into theater.\n\n**The second failure mode** is using Grilling on pure refactors with a verified baseline. If the existing code already works, the tests already pass, and the goal is to clean up structure without changing behavior - the original decision was already grilled (or should have been) when the original code was written.\n\nRe-grilling at refactor time is litigating a settled question. The right thing in that case is a different gate: **a behavior-preservation check**, not **a should-this-exist** check.\n\n**The third failure mode** is using Grilling during exploratory prototyping, where the entire point is to fail fast and learn. If you’re spiking out three different approaches to see which one is even tractable, you don’t want each spike to get a full adversarial review - you want to throw cheap code at the problem and see what survives contact with reality. Grilling here actively kills the exploration.\n\n**The fourth failure mode** is using Grilling under genuine time pressure when the cost of being wrong is small. Production is on fire, the fix is small, you’re confident in the diagnosis, and the cost of an extra hour of debate is real customer pain. Skip it.\n\nDocument what you did. If the fix turns out to be wrong, that’s what the Ledger is for - you log the failure and feed it into Reconnaissance for next time.\n\n**So when should you grill?**\n\nUse Grilling for new features that touch architectural decisions - anything where the structural shape of the change matters, not just its correctness.\n\nUse Grilling for changes that introduce a new dependency, a new external integration, a new data model - these are the changes where the cost of getting it wrong propagates for years.\n\nUse Grilling for security-relevant changes, where the failure mode is we shipped a vulnerability - the Devil’s Advocate role is genuinely valuable here, because security failures are exactly the failures that careful, well-meaning people miss.\n\nUse Grilling any time the cost of building the wrong thing is meaningfully larger than the cost of arguing about it for an hour.\n\nThe decision rule is brutal but simple: **how much will it cost to undo this if you’re wrong?** If the answer is more than the Grilling itself, grill it. If the answer is less, don’t.\n\nThe hard part isn’t applying the rule.\n\nThe hard part is being honest about which side of the rule a given task falls on.\n\nMost engineers underestimate the cost of being wrong, because the cost is mostly invisible - it shows up later, in the form of technical debt, integration headaches, security audits that find old shortcuts, and refactors that take months to unwind.\n\nGrilling is the moment you pay that cost up front, in tokens and minutes, instead of paying it later in engineer-years.\n\nIf your framework doesn’t have a Grilling phase, your framework is a productivity tool for shipping bad ideas faster.\n\nThat’s a real product. There’s a market for it. Plenty of people want their bad idea shipped quickly and don’t want to be told it’s bad. Fine. Ship it. Sell it.\n\nTo be fair, most existing frameworks aren’t claiming to do this - they’re claiming to enforce rigor in **implementation**, and they do that genuinely well.\n\nSpec-Kit, MUSUBI, Tessl, the rest - within their scope, they’re honest about what they offer. The problem is the gap between what they offer and what users **think** they’re getting.\n\nIf you read the marketing, “spec-driven” sounds like the spec is the source of truth. It isn’t. The spec is just the input that the rigor machinery operates on. The spec itself was never on trial.\n\nThe next generation of AI frameworks won’t be the ones with more agents, longer context, or fancier orchestration. It’ll be the ones brave enough to tell the user **no** before writing a single line of spec.\n\nThat’s the bar. Almost everyone is below it. The hole is right there in the middle of every framework, and we’re all stepping around it pretending it isn’t there.\n\nStop pretending.\n\nRecon the ground. Grill the idea. Kill the bad ones. Build the survivors.\n\nThat’s the whole job.\n\n[Part 1](https://dev.to/kucherenko/16-frameworks-one-blind-spot-20cg) mapped the landscape and named the gap. Part 2 showed what filling the gap actually looks like - the agents, the rounds, the termination conditions, the failure modes.\n\nThe next pieces in this series will go deeper into the rest of the Heist Pipeline: the Sit-Down (where the Contract gets signed), Resource Development (where the plan gets built), the Hit (where code finally gets written), and Laundering (where everything gets verified and logged into the Ledger).\n\nEach of these phases has the same general design philosophy - explicit gates, named artifacts, no phase skippable - but they solve\n\ndifferent problems.\n\nIf you build agentic systems and you’ve felt the productivity-tool-shipping-bad-ideas-faster problem, [Gangsta Agents](https://gangsta.page/) is open source. It’s a young project (first stable release in April 2026, v1.1.1). Issues, PRs, and adversarial critique of the framework itself are all welcome. Especially the last one - it would be embarrassing to ship a framework about Grilling without grilling the framework.\n\n[← Part 1: Sixteen Frameworks. One Blind Spot.](https://dev.to/kucherenko/16-frameworks-one-blind-spot-20cg)\n\n*Gangsta Agents is an open-source agentic framework built around a 6-phase Heist Pipeline: Reconnaissance → Grilling → Sit-Down → Resource Development → The Hit → Laundering. Every phase has a gate. No phase is skipped.*", "url": "https://wpnews.pro/news/the-grilling", "canonical_source": "https://dev.to/kucherenko/the-grilling-29d1", "published_at": "2026-05-28 14:00:37+00:00", "updated_at": "2026-05-28 14:24:43.265846+00:00", "lang": "en", "topics": ["ai-agents", "ai-safety", "ai-research", "large-language-models", "artificial-intelligence"], "entities": ["kucherenko"], "alternates": {"html": "https://wpnews.pro/news/the-grilling", "markdown": "https://wpnews.pro/news/the-grilling.md", "text": "https://wpnews.pro/news/the-grilling.txt", "jsonld": "https://wpnews.pro/news/the-grilling.jsonld"}}