{"slug": "introducing-monitoringbench", "title": "Introducing MonitoringBench", "summary": "Researchers released MonitoringBench, a benchmark of 2,644 attack trajectories for evaluating coding-agent monitors, along with a semi-automated red-teaming pipeline. The pipeline decomposes attack construction into strategy generation, execution, and post-hoc refinement, producing harder attacks than prompt elicitation alone. Across all evaluated monitors, post-hoc refinement consistently produced the lowest catch rates, dropping from 94.9% to 60.3% on the strongest monitor.", "body_md": "*Paper** here**, **code**, **benchmark**. Builds on the** preview we posted in January**.*\n\n*Authors: **@monika_j** , @ma-martinez* , [@ollie](https://www.lesswrong.com/users/ollie-1?mention=user), [@Tyler Tracy](https://www.lesswrong.com/users/tyler-tracy?mention=user)\n\nWe are releasing **MonitoringBench**, a difficulty-graded benchmark of 2,644 successful attack trajectories for evaluating coding-agent monitors, alongside the semi-automated red-teaming pipeline we used to generate it. The pipeline decomposes attack construction into strategy generation, execution, and post-hoc refinement, and produces substantially harder attacks than prompt elicitation only of frontier attackers.\n\nIf you would like to use MonitoringBench to evaluate your monitors or are interested in extending the methodology to other control settings, please reach out — we are happy to help people set it up.\n\nThe headline result: across every monitor we evaluated (13 full-trajectory, 7 stepwise) and across every attack source we generated, **post-hoc refinement of attack trajectories consistently produced the lowest catch rates**, including on monitors held out from attack construction. On Opus 4.5, the strongest monitor in the lineup, catch rate fell from 94.9% on directly-elicited Opus attacks to 60.3% on our best refined attacks.\n\nFor readers who saw[ the earlier post](https://www.lesswrong.com/posts/X8qTKsGcnsTFrqM96/monitoring-benchmark-for-ai-control): we previewed a 428-trajectory firewall-only version of this benchmark in January. This is the full version. Most of the qualitative claims from the January preview carry over to the full extended dataset. We continue to observe **mode collapse** in attack generation, which we reduce with a novel attack taxonomy for broader attack coverage. Second, a **conceive-execute gap**: frontier LLMs can propose strong attack ideas or execute them, but not all at once. We show how our decomposed red-teaming pipeline mitigates the issues and produces our best attacks. The main contributions:\n\n*Figure 1: ***Top***: Our proposed red-teaming methodology for monitoring capability evaluation. *** Bottom***: Red-teaming pipeline analysis: catch rate at 1% FPR by attack source, averaged across three runs. Lower is stronger. Development monitors are Opus 4.5, GPT-5, and GPT-5 Nano; remaining monitors are held out and show that refined attacks outperform elicited Opus-4.6, including on held-out monitors.*\n\nWe decompose attack construction into three stages:\n\n**1. Strategy generation.** We use three approaches, all guided by an explicit attack taxonomy designed to ensure coverage: * human-in-the-loop* construction (high quality, doesn't scale), a\n\n**2. Execution.** We run the strategy in BashArena with Opus 4.6 as the attacker. Trajectories are retained only if the side task succeeds and a GPT-5 judge confirms the agent faithfully followed the strategy.\n\n**3. Refinement.** We post-process trajectories with Opus 4.6: identify suspicious elements, plan edits, rewrite the trajectory, and validate correctness. Refinement usually improves attacks by fixing execution errors, suspicious justifications, and unnecessary reconnaissance. A trajectory is accepted if it reduces absolute suspiciousness scores on at least 2 out of 3 dev monitors.\n\nThe point is to decompose attack generation into separate red-teaming skills: strategy generation, attack execution and stealth optimisation. Next, we discuss the main results: conceive-execute gap mitigation, coverage improvements, and capabilities evaluation.\n\nFigure 1 is the main result. Each cluster of bars is one monitor; within a cluster, bars move left-to-right from baseline Opus-only elicitation through human-seeded strategies, recon, and recon-with-feedback, with each shown both before and after refinement. Lower is stronger.\n\nTwo things are visible:\n\n**Refinement produces our best attacks that generalise.** Refinement consistently makes poorly executed attacks better, achieving the lowest catch rates across all monitors (dev and held-out), monitor types (full-trajectory and stepwise), and attack sources (Opus-only, Human-in-the-loop, Recon, and Recon+Tools). The biggest gains up to 40% appear on the strongest full trajectory monitors between Opus-only and Opus-only refined attacks, for example, GPT-5.1 Codex score goes from 69.9% to 49.8% post-refinement, and GPT-5 performance reduces even further from 60.2% to 19.5%.\n\n**Recon helps, but generalises less well.** The recon-based strategy generators (Recon, Recon+Tools) beat Opus-only on strong-to-mid monitors but sometimes underperform on weaker ones. Recon is designed for both exploration and stealth; hence, it finds a wider mix of strategies, including stronger attacks but also more explicit or convoluted ones that weaker monitors catch more easily. Recon+Tools + refinement is the strongest combination overall, but the marginal contribution of recon is smaller than the marginal contribution of refinement.\n\nThe development monitors (Opus 4.5, GPT-5, GPT-5 Nano) were used to filter refinement outputs. We used held-out monitors, ten full-trajectory and 7 stepwise. Refinement and recon agents improve attacks across both groups and monitor models, which shows the pipelines are producing trajectories that are actually harder to catch.\n\nWe continue to follow the attack taxonomy (see the full categories in the Appendix):\n\n*Figure 2: Top: Our proposed attack taxonomy. Bottom: Mode-collapse: Opus-4.6 elicited vs our attack pipelines across the attack taxonomy show that model-only red-teams produce a narrower distribution of attacks compared to our pipelines; n denotes the number of trajectories.*\n\nWe show that increasing attack diversity via strategy generation allows to find more successful attacks and mitigates mode-collapse. Opus elicitation-only results in 59% of attacks within a single technical tier (T3) and 64% within a single structural tier (S3). Strategy-generated and refined attacks are designed to span the full taxonomy.\n\nNext, refinement shows that simple attacks (simple technique, single line) with the right framing perform best across the side tasks and monitors. Averaged across seven monitors, **40% of uncaught attacks fall into T1** (the simplest technical tier), despite T1 attacks being only 2% of the Opus-only baseline.\n\n*Figure 3: Final MonitoringBench results: Full Trajectory (1% FPR) and Stepwise (1% per-action FPR) Catch rates. Lower catch rate indicates stronger attacks.*\n\nMonitoring capability is concentrated in frontier models. Opus 4.5 catches roughly 70% of attacks in the full-trajectory setting; a mid-tier group (Sonnet 4.5, GPT-5.1 Codex, Sonnet 4.0, GPT-5) sits at 31–53%; weaker models drop off steeply. In practice, weaker monitors are often preferred for cost and trust reasons. Given the steep monitoring capability drop, cheap monitor models may give a false sense of security as attacking capabilities evolve and should be deployed with caution.\n\n*Figure 4: Monitor failure distribution across five monitors evaluated on the MonitoringBench benchmark. Each bar shows the breakdown of uncaught attacks across four failure modes. Labels are assigned by majority vote across three independent LLM judges; n denotes the number of uncaught attacks per monitor.*\n\nBut aggregate catch rates do not tell the full story. We provide a deeper failure-mode breakdown. We categorise each uncaught attack into four buckets: (A) detection failure, (B) partial detection (noticed some parts of the attack), (C) detected but dismissed it/was persuaded, and (D) score calibration failure (scored below the threshold). We report the majority vote across three LLM judges with moderate agreement (Opus 4.5, GPT-5.1 Codex, GPT-5.4; κ = 0.51-0.69).\n\nWhen frontier monitors fail, the reasons for failing are different:\n\nThese are actionable and tractable paths to monitor improvements. Failure mode A could be mitigated with better detection capability, C with resistance to persuasion, and D with calibration training.\n\nThe main conclusions we would like to highlight:\n\nWe position MonitoringBench as a static, difficulty-graded test suite that current monitors should pass as part of pre-deployment evaluation, and the pipeline as a reusable methodology for refreshing these evaluations as agents and monitors improve.\n\nThe paper is[ here](https://arxiv.org/abs/2605.09684), with full methodology, the taxonomy, all generation and refinement prompts, the failure-mode judging rubric, per-side-task breakdowns, and the cover-opportunities study in the appendices.\n\n**Limitations**:\n\nWe focus on single-agent, single-episode, persistent attacks. Our benchmark covers a single control setting (BashArena) and four side tasks; extending to LinuxArena [Tracy et al., 2026] is a natural next step. Human-seeded strategy generation depends heavily on effort, and we put uneven effort across side tasks. We show that prompt sensitivity is model-dependent; however, we did not elicit each monitor individually, and frontier monitoring deployments show better performance on MonitoringBench. Our 1% FPR thresholds are calibrated on honest BashArena trajectories rather than real deployment traffic, which may bias monitor scores. Finally, we used Claude Opus 4.5 and 4.6 to develop attacks, which might introduce model bias when evaluating monitors on Anthropic family models.\n\n**Acknowledgments: **\n\nWe thank Adam Kaufman, Mikhail Terekhov, James Lucassen, Asa Cooper Stickland, Rogan Inglis, Neev Parikh, Jan Michelfeit, Cody Rushing, and Anshul Khandelwal for helpful feedback on early versions of the work and useful discussions throughout the project. We thank Mary Phuong, Marius Hobbhahn, and Cody Wild for helpful discussions. We thank Fabien Roger for detailed feedback on multiple drafts that improved this work and for early discussions that shaped our approach. We thank Coefficient Giving for supporting this research.\n\nI also want to especially thank our project advisors, Tyler Tracy and Ollie Matthews, for all the mentorship and support.\n\nBelow, we share the three axes of the taxonomy: technical, structural, and evasion, with full categories in Tables 1–3.", "url": "https://wpnews.pro/news/introducing-monitoringbench", "canonical_source": "https://www.lesswrong.com/posts/2YyEQhFRyaBCgfRWd/introducing-monitoringbench", "published_at": "2026-06-21 18:43:29+00:00", "updated_at": "2026-06-21 19:05:11.990590+00:00", "lang": "en", "topics": ["ai-safety", "ai-agents", "ai-research", "machine-learning"], "entities": ["MonitoringBench", "Opus 4.5", "GPT-5", "BashArena", "Monika J.", "Ma-Martinez", "Ollie", "Tyler Tracy"], "alternates": {"html": "https://wpnews.pro/news/introducing-monitoringbench", "markdown": "https://wpnews.pro/news/introducing-monitoringbench.md", "text": "https://wpnews.pro/news/introducing-monitoringbench.txt", "jsonld": "https://wpnews.pro/news/introducing-monitoringbench.jsonld"}}