{"slug": "measuring-llms-ability-to-develop-exploits", "title": "Measuring LLMs' ability to develop exploits", "summary": "Anthropic researchers found that its Claude Mythos Preview model can identify complex software vulnerabilities and combine them into complete end-to-end attack chains, marking a significant advancement over previous frontier models. The company measured the model's exploit development capabilities using three new academic benchmarks—ExploitBench, ExploitGym, and an updated SCONE-bench—where Mythos Preview outperformed all other evaluated models. Anthropic cited this capability as a primary reason for releasing the model through its restricted Project Glasswing program rather than through a general release.", "body_md": "May 22, 2026\n\nNewton Cheng, Keane Lucas, Winnie Xiao, Nicholas Carlini, and Milad Nasr\n\n[Claude Mythos\nPreview](https://www-cdn.anthropic.com/8b8380204f74670be75e81c820ca8dda846ab289.pdf)’s ability to develop exploits is a step-change over previous frontier models. This was one\nof our primary motivations for rolling out the model carefully through [Project Glasswing](https://www.anthropic.com/glasswing) rather than through a general release.\nMythos Preview is capable of finding complex vulnerabilities, but what concerned us most in our [internal testing](https://red.anthropic.com/2026/mythos-preview/) was that Mythos Preview could\nboth turn vulnerabilities into exploit primitives, and combine those primitives together into complete\nend-to-end attack chains.\n\nWhen we published our Mythos Preview [results](https://red.anthropic.com/2026/mythos-preview/), we\nmeasured its capabilities by having it search for novel zero-days and then build exploits for them.\nQualitative evaluations like this are helpful for showcasing a model’s capabilities—but ideally, we would\nhave high-quality quantitative benchmarks that let us measure them precisely. The problem we faced at the\ntime we released Mythos Preview was that no existing public exploit benchmarks were difficult enough to\ncapture Mythos Preview’s capabilities in our initial testing.\n\nOver the last month, however, we have seen the development of two new, more challenging academic benchmarks:\n[ExploitBench](http://exploitbench.ai) and [ExploitGym](https://rdi.berkeley.edu/blog/exploitgym/). We collaborated with the researchers\nwho produced these benchmarks to measure Mythos Preview’s performance, and also ran Mythos Preview on an\nupdated version of SCONE-bench, a benchmark we developed in collaboration with [MATS](https://www.matsprogram.org/) and the Anthropic Fellows Program to measure smart contract\nexploitation. On all three benchmarks, we’ve found that Mythos Preview consistently outperforms all other\nevaluated models. We believe this is further evidence that the knowledge and expertise required to develop\nexploits will drop significantly as Mythos-level capabilities become more widely available.\n\n[ExploitBench](http://exploitbench.ai) is a benchmark to study the exploit development\ncapabilities of large language models. It’s built by [Seunghyun\nLee](https://github.com/leesh3288) and Prof. [David Brumley](https://users.ece.cmu.edu/~dbrumley/) from Carnegie Mellon\nUniversity and Bugcrowd. What makes this benchmark interesting is that it focuses on measuring the ability\nof language models to write complete end-to-end exploits. Prior benchmarks typically focused on measuring\nthe ability of language models to write a “proof-of-concept” that shows the existence of a vulnerability.\nBut a proof-of-concept only indicates that a bug is reproducible or reachable, not that an attacker could\nuse it to actually cause harm. In ExploitBench, language models must build exploit primitives out of the\nvulnerability in order to enable new capabilities, such as granting the attacker arbitrary code execution\n(ACE).\n\nExploitBench decomposes the exploit development process into 16 distinct capabilities. Each of these is verified programmatically, which allows fine-grained analysis of the different intermediate capabilities required to build working exploits. The 16 capabilities are divided into five capability tiers, forming a capability ladder:\n\nUsing this framework, the authors build a V8 benchmark, which uses a set of 41 (now patched) vulnerabilities\nin the V8 JavaScript and WebAssembly engine that are sourced from the [V8\nExploit Tracker](https://docs.google.com/document/d/1njn2dd5_6PB7oZGTmkmoihYnVcJEgRwEFxhHnGoptLk/edit?tab=t.0#heading=h.wjlg391wmhzr). The [V8 engine](https://v8.dev/) is widely used infrastructure, powering\nChromium-derived applications (e.g., Chrome, Edge, Android WebView), Node.js environments (server backends),\nand Electron apps (e.g., VS Code, Slack, Discord). A key element of this framework is testing against\nsecurity defenses: the V8 sandbox walls off the memory where a webpage’s JavaScript objects live, so that a\nV8 bug doesn’t become a foothold deeper into the browser. The highest scoring tier means arbitrary code\nexecution in the entire V8 process (in a browser, this is like taking control over an entire tab).\n\nGiven a vulnerable build of the V8 engine and the patch that fixes a given vulnerability, the language model is instructed to build an exploit for that bug. The exploits are then scored automatically against all 16 capabilities, with no human or LLM judge. Lower tiers are checked by differential execution against the patched build; higher tiers use challenge-response functions built into V8 that are replayed across multiple randomized heap layouts, so hardcoding a leaked address won't pass. A separate static scan of the transcripts flags other forms of cheating as a backstop.\n\nAll models run on an identical ExploitBench harness with a 300 turn budget, which itself has two variants: Baseline and Nudged. In the Nudged variant, additional prompts are adaptively injected by the harness to warn the model to wrap up when close to the budget limit, or to encourage the model to use up its turn budget if it stops too early. Each variant is run for three trials. Anthropic ran all Claude models, and then provided all results and transcripts to the benchmark authors, who verified the results.\n\nConsistent with our previous findings on [Mozilla\nFirefox](https://red.anthropic.com/2026/firefox/), all language models can reach or trigger the given vulnerabilities, but only models since\nClaude Opus 4.6 make any progress in developing primitives inside the V8 sandbox. Escaping the V8 sandbox,\ngoing from T3 to T2, is the next capability cliff; Mythos Preview is the only tested model that can reliably\ndo so, which it does in over half the tested environments. It also achieves control flow hijack (T1) in\nalmost half the environments in the Baseline variant. Combining Baseline and Nudged variants, Mythos Preview\nachieves ACE on 21 out of 41 CVEs, whereas no other model achieved even 1 ACE in either variant. The only\nother model to achieve ACE on the scoreboard did so in 2 out of 41 CVEs, and only using a proprietary\nscaffold.\n\nIn addition, the authors do a [deep analysis](https://exploitbench.ai/blog/human-observations/) of\na few of Mythos Preview’s exploit attempts. In one case, Mythos Preview was able to create a\nnear-deterministic exploit for a bug, CVE-2023-6702, where publicly known exploits were probabilistic and\nuncontrolled. Because deployment of exploits may be limited to just one attempt, stability is often critical\nto real-world exploits that are bought and sold. How Mythos Preview\nachieved this was impressive as well. Seunghyun Lee, one of the authors of ExploitBench, wrote, “I have\nprivately discussed the possibility of precisely this exploit plan with the original author of the 1-day\nv8CTF exploit, which we quickly dismissed due to the complexity of the approach. Mythos executed this\ncleanly and flawlessly without any publicly available information on this specific exploit\ntechnique.”\n\nRead more of this qualitative analysis [here](https://exploitbench.ai/blog/human-observations/),\nand see the benchmark website at [exploitbench.ai](http://exploitbench.ai) or [preprint](https://exploitbench.ai/exploitbench.pdf) for more information.\n\n[ExploitGym](https://rdi.berkeley.edu/blog/exploitgym/) is a second benchmark that aims to measure\nlanguage model exploitation capabilities across a broad target set. It was developed as a collaboration\nbetween UC Berkeley, the Max Planck Institute for Security and Privacy, UC Santa Barbara, and Arizona State\nUniversity (with contributions from security researchers at Anthropic, OpenAI, and Google), as a follow-on\nto the [CyberGym](https://www.cybergym.io/) vulnerability-reproduction benchmark.\n\nThe authors of ExploitGym apply their evaluation framework to 898 now-patched vulnerabilities across many\nprojects in [OSS-Fuzz](https://github.com/google/oss-fuzz), the V8 engine, and the Linux kernel.\nTogether, these three target classes cover large fractions of the world’s most used software.\n\nFor a given vulnerability, the language model is provided with build information (vulnerable source code and build scripts), vulnerability information (proof-of-vulnerability; vulnerability description), runtime information (compiled binary; launch script), and a remote target running the vulnerable entrypoint. The language model is then tasked with developing a working exploit that achieves unauthorized code execution against the target, running code at a privilege level that the target’s security model should make unreachable. It must then use that elevated privilege to retrieve a dynamically generated flag. An attempt is marked successful only if both the correct flag is submitted and a model judge determines the attempt to have exploited the intended vulnerability (as opposed to a different, possibly more easily exploitable, vulnerability). The evaluation framework supports toggleable security mitigations, such as the V8 heap sandbox and Linux Kernel Address Space Layout Randomization (KASLR).\n\nThe baseline framework for evaluation uses a two hour wall-clock time limit, with security mitigations toggled off, and models are run with their developers’ recommended harness, e.g. Claude models are run with the Claude Code harness. All models are run with identical prompts. Anthropic ran the Opus 4.6 and Mythos Preview trials.\n\nWithin the two-hour window, Mythos Preview successfully achieves unauthorized code execution using the intended vulnerability on 157 tasks, expanding to 226 successful flag captures when including attempts involving paths to code execution that do not use the intended vulnerability. Previous generations of Claude models succeed at a significantly lower rate; for example, Opus 4.6 only achieves 15 successes with the intended vulnerability, expanding to 36 when including success via alternative vulnerability. Looking at the distribution of successes among the three classes of targets, Mythos Preview’s improvements are present across all classes, and it is one of only two reported models able to frequently develop kernel exploits.\n\nSee the authors’ [blog](https://rdi.berkeley.edu/blog/exploitgym/) or [preprint](https://arxiv.org/abs/2605.11086) for more details.\n\nLast year, in collaboration with MATS and the Anthropic Fellows Program, we developed the [Smart Contract Exploitation\nbenchmark](https://red.anthropic.com/2025/smart-contracts/) (SCONE-bench) to study the ability of LLMs to find and exploit vulnerabilities in smart\ncontracts. For each smart contract, the language model is instructed to identify a vulnerability and create\nan exploit to steal funds managed by the contract in local simulation. Performance is measured by the total\n(simulated) revenue from successful exploitations.\n\nWe ran an updated version of the benchmark that uses 12 exploits reported after the latest knowledge cutoff\ndates of all models (January 1, 2026), with problems sourced from the [DefiHackLabs dataset](https://github.com/SunWeb3Sec/DeFiHackLabs/tree/main/src/test/2026-01).\nFor each smart contract that was successfully exploited by the language model, we calculate the exploit’s\ndollar value by converting the model’s revenue in the native token to USD using the historical exchange rate\nfrom the day the real exploit occurred, as reported by the CoinGecko API. We then sum up the total value\nacross all exploits, and plot this on the log-scaled figure below.\n\nWe find that Mythos Preview can exploit $35 million worth of smart contracts on this benchmark, $15 million\nor about 75% more than the next-closest model we tested. The latest frontier models are both able to more\nconsistently exploit vulnerabilities (corresponding to higher attack success rates), and are able to more\nefficiently leverage a given exploit to steal more funds. The gap in revenue between Mythos Preview and\nother models is driven largely by Mythos Preview being the only model to successfully exploit every\nvulnerability tested. Opus 4.7 is the only other model able to exploit `truebit`\n\n; no other models\nwere capable of exploiting `makina`\n\nin an 8-trials setting. We noted in our [original post](https://red.anthropic.com/2025/smart-contracts/) that, measured according to\ntotal revenue vs. time-of-release, the performance of models prior to Opus 4.5 follows a log-linear\ntrajectory, with a mean doubling time of 1.1 months. Our models since Opus 4.5 continue to follow this\ntrend, but at a doubling time of only 0.7 months. We remarked in that post that “we expect the doubling\ntrend to plateau eventually”—but evidently we have not yet reached this plateau.\n\nAlongside this post, we are also open-sourcing the harness and dataset for SCONE-bench [here](https://github.com/anthropics/scone-bench).\n\nWhereas the strongest models from February of this year could only barely develop exploits in simulated scenarios with most defense measures disabled, Mythos Preview is able to construct full end-to-end exploits on the world’s most widely-used software. We believe that Mythos-level models will become widely available in the next 6-12 months. As they do, this kind of exploit development will require dramatically less specialist expertise, becoming increasingly commoditized.\n\nAs models continue to become more capable, the cost of misjudging what they can do rises with it. Meeting\nthis challenge requires building precise and comprehensive profiles of a model’s capabilities, which in turn\nrequires the development of high-quality, publicly-available benchmarks—realistic and difficult tasks built\nby people with deep domain expertise. The field needs more work like ExploitBench and ExploitGym, across\nmore vulnerability classes, more targets, and more stages of the cyber attack chain. As part of our\ncommitment to studying and mitigating the risks posed by increasingly powerful models, we are supporting the\ndevelopment of high-quality, rigorous evaluations of models in the cyber domain. Please reach out via our [External\nResearcher Access Program](https://support.claude.com/en/articles/9125743-what-is-the-external-researcher-access-program) for more details.\n\nBetter measurement is necessary but not sufficient for responsible deployment. In addition to supporting\ncyber defenders with Project Glasswing, we’ve introduced the [Cyber\nVerification Program](https://support.claude.com/en/articles/14604842-real-time-cyber-safeguards-on-claude), allowing us to more aggressively block potentially malicious cyber threats\nwithout cutting off defenders who are using Claude to secure their own software and infrastructure.\n\nIf you’re interested in helping us with our efforts, we have [job openings](https://www.anthropic.com/careers) available for [research scientists and\nengineers](https://job-boards.greenhouse.io/anthropic/jobs/5076477008), [threat\ninvestigators](https://job-boards.greenhouse.io/anthropic/jobs/5066995008), [policy\nmanagers](https://job-boards.greenhouse.io/anthropic/jobs/5066981008), [offensive security\nresearchers](https://job-boards.greenhouse.io/anthropic/jobs/5123011008), [security\nengineers](https://www.anthropic.com/careers/jobs?team=4002063008), and [many others](https://www.anthropic.com/careers/jobs).", "url": "https://wpnews.pro/news/measuring-llms-ability-to-develop-exploits", "canonical_source": "https://red.anthropic.com/2026/exploit-evals/", "published_at": "2026-05-30 02:26:07+00:00", "updated_at": "2026-05-30 02:45:39.028417+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research", "artificial-intelligence", "ai-agents"], "entities": ["Newton Cheng", "Keane Lucas", "Winnie Xiao", "Nicholas Carlini", "Milad Nasr", "Claude Mythos Preview", "Anthropic", "Project Glasswing"], "alternates": {"html": "https://wpnews.pro/news/measuring-llms-ability-to-develop-exploits", "markdown": "https://wpnews.pro/news/measuring-llms-ability-to-develop-exploits.md", "text": "https://wpnews.pro/news/measuring-llms-ability-to-develop-exploits.txt", "jsonld": "https://wpnews.pro/news/measuring-llms-ability-to-develop-exploits.jsonld"}}