{"slug": "suggestions-for-improving-debate-protocols-in-ai-safety", "title": "Suggestions for improving debate protocols in AI safety", "summary": "Researchers reviewing AI safety debate protocols found that current \"propose-critique-decide\" models are vulnerable to gaming, where critic models exploit a \"last mover advantage\" by withholding key critiques until the final turn. The structure of American policy debate, which prohibits introducing new arguments in rebuttal speeches and includes cross-examination periods, offers ready-made alternatives to counteract such undesirable model behaviors. These competitive debate formats could inform future AI safety implementations by providing fixed protocols that prevent models from manipulating turn order to skew outcomes.", "body_md": "While many leading AI Safety researchers share an intuition that debate can be a powerful element of AI Safety measures, the nuaunces of debate protocols seem to be a less explored facet of the research. Competitive human debate offers a wealth of existing formats with distinct rules, which could inform future AI Safety implementations. The rules of competitive debate present ready-made alternative protocols to counteract observed model gaming behaviours and may present options to subvert undesirable model tendencies. In this post, I’ll provide an overview of American policy debate rules/structure and suggest how the various formats of competitive debate can inform AI Safety debate protocols.\n\nI recently reviewed the Winter 2026 [MATS Research](https://www.matsprogram.org/) posters and presentations. Among these, I was especially keen to review the [Building an Empirical Science of AI Debate](https://m.youtube.com/watch?v=39giC0wx8Fc) presentation by Lennie Wells and [@joanv](https://www.lesswrong.com/users/joanv?mention=user).\n\nI’m drawn to this AI Safety topic above many others because I spent three years in high school debating competitively, which was one of the most intellectually demanding ways to mangle and expand an adolescent mind. The intensity, rigour and incessant exploration was transformative. I imagine most former debaters would feel at home here. If you’ve never witnessed American policy debate, you might think of it as something like speed chess with arguments and evidence.\n\nWhen I reviewed the new MATS presentation, I couldn’t help but contrast their “propose-critique-decide” protocol with the format of a policy debate round. Specifically, the current “training via self-play RL” approach is limited by critic models gaming the protocol, using the \"last mover advantage\" to withhold the most valuable critique until the final turn of the debate, skewing the judge’s vote accordingly. The structure of a policy debate round has been designed to address that weakness. There are other facets of the structure that could be useful references as well.\n\nA policy debate round pits a pair of two-person teams against each other. Each speaker gives two speeches, gives one cross-examination, and receives one cross-examination. Policy debate is sometimes known as cross-examination debate, or CX debate. A round lasts approximately 90 minutes with many speakers packing as many arguments as possible into a fixed time allocation by speed reading (AKA spreading) at ~300 words per-minute. I only mention the speed dimension of this because it is a normalised human form of gaming these rules that we should be attuned to if a similar time (or token) constraint were to be imposed on an AI Safety architecture embracing these rules. Would models debate in more efficient, inhuman language to use tokens more efficiently? Would we encourage them to do so?\n\nEach speaker gives one Constructive and one Rebuttal speech. Constructive speeches are 8 or 9 minutes long. Rebuttals are 5 or 6 minutes. Cross-examinations create a break between each Constructive speech, supporting interrogation and interpretation of the preceding speech. Constructive speeches are used to elaborate your team’s “affirmative” or “negative” positions on the topic. There is a lot of nuance specific to policy in the wider rules, but one facet that generalises well to AI Safety is the Constructive > Rebuttal flow. Speakers aren’t allowed to introduce new arguments in Rebuttals. They can only continue debating the arguments that have been laid out in Constructive speeches.\n\n**Structure of a policy debate round **\n\n|\n|\n|\n8 minutes | 9 minutes | |\n| 3 minutes | 3 minutes |\n8 minutes | 9 minutes | |\n| 3 minutes | 3 minutes |\n8 minutes | 9 minutes | |\n| 3 minutes | 3 minutes |\n8 minutes | 9 minutes | |\n| 3 minutes | 3 minutes |\n| 5 minutes | 6 minutes |\n5 minutes | 6 minutes | |\n| 5 minutes | 6 minutes |\n5 minutes | 6 minutes |\n\nI mentioned earlier that this structure addresses the “last mover advantage” problem encountered in the MATS research. It does this by disallowing new arguments in the final four speeches, but also by flipping the sequence of turns at the beginning of the Rebuttals, so the Affirmative team has the first and last speech. Judges know they should ignore new topics added in the final speech. Affirmative teams have the burden to prove that their proposition is correct, but they get the opportunity to take the last move. This comes at the cost of a demanding First Affirmative Rebuttal after two opposing speeches, but this seems less counterintuitive once you’ve played this game, since all Rebuttal speeches are the same length.\n\nPolicy debate structure emphasises the dual qualities of building strong arguments in Constructive speeches, while engaging deeply with strong counter-arguments in Rebuttals to arrive at clearer likelihoods, comparative impacts, and framing that judges must consider. If a Bayesian analysis should be compelling, a debater can bring that in to the debate. Cross-examination seems especially relevant to AI Safety. This is where an opposing team can unveil error, deception, bad citations, adherence to the rules, and generally call out anything flimsy. It provides space to debate the framing and evaluation criteria that the judge (or judges) should use.\n\nIn an AI Safety architecture embracing these structures and rules, each speaker and judge could be a distinct agent and learn the specialist skills associated with that role in the structure. Models behind each speaker agent could be swapped or ensembled. Each debate could be re-run with different models. A round could be re-run with the same models swapping sides (potentially mitigating sycophancy problems or other tendencies).\n\nI only have ideas (rather that answers) about how debate speech time limits should translate into AI Safety debater limits. Token budgets per “speech” seems like an obvious candidate, but other questions emerge. In a Policy debate round, each team has a pool of preparation time to spend in between speeches, where they will gather evidence, expand notes and structure speeches. Perhaps something similar could be afforded with additional reasoning token budgets? Perhaps it isn’t necessary at all.\n\nAll told, these protocols seem under-explored given that there are nine formats of competitive high school debate in the U.S. alone: [National Speech & Debate Association High School Unified Manual](https://docs.google.com/document/u/0/d/1hq7-DE6ls2ryVtOttxR4BNpRdP7xUbBr0M3SMYefek8/mobilebasic). The rules seem esoteric from the outside, and only some facets will be useful in other contexts, but there’s a lot to reference in AI Safety debate protocols. At a glance, an architecture based on the policy debate structure might be more cumbersome (and expensive) to implement than existing protocols, so that may limit usefulness in some cases. In any case, this looks like a viable alternative to at least one current protocol limitation which is actively seeking other options. If curious, I’m happy to help explain more of this weird, but instructive world. I’ve tried to keep this somewhat brief.\n\nSharing here rather than burying these thoughts in direct feedback to the MATS researchers, as I’m unaware of deep focus on the protocol in most AI Safety research, but if it is an active thread, I’d love to engage with more of it.", "url": "https://wpnews.pro/news/suggestions-for-improving-debate-protocols-in-ai-safety", "canonical_source": "https://www.lesswrong.com/posts/D4tBvaQSc6uFnxisp/suggestions-for-improving-debate-protocols-in-ai-safety", "published_at": "2026-05-29 02:31:35+00:00", "updated_at": "2026-05-29 02:48:26.359465+00:00", "lang": "en", "topics": ["ai-safety", "ai-research", "ai-policy", "large-language-models", "artificial-intelligence"], "entities": ["MATS Research", "Lennie Wells", "joanv"], "alternates": {"html": "https://wpnews.pro/news/suggestions-for-improving-debate-protocols-in-ai-safety", "markdown": "https://wpnews.pro/news/suggestions-for-improving-debate-protocols-in-ai-safety.md", "text": "https://wpnews.pro/news/suggestions-for-improving-debate-protocols-in-ai-safety.txt", "jsonld": "https://wpnews.pro/news/suggestions-for-improving-debate-protocols-in-ai-safety.jsonld"}}