A Red-Team Study of Anthropic Fable 5 and Opus 4.8 Models

A red-team study of Anthropic's Fable 5 and Opus 4.8 large language models found that both models remain reliably breakable under sustained automated pressure, with Opus 4.8 producing 1,620 and Fable 5 producing 702 panel-confirmed harmful completions across all harm categories. The study used the HackAgent framework to generate hundreds of thousands of adversarial attempts, revealing that adaptive iterative attacks dominate the residual vulnerability surface while static obfuscation is nearly fully neutralized.

Computer Science Cryptography and Security Submitted on 16 Jun 2026 Title:A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models View PDF /pdf/2606.18193 HTML experimental https://arxiv.org/html/2606.18193v1 Abstract:We evaluate the adversarial robustness of two frontier large language models LLMs developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models majority vote . Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search tree-of-attacks breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits 6.1% worst-case . Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 Opus 4.8 and 702 Fable 5 panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure. Current browse context: cs.CR References & Citations Loading... Bibliographic and Citation Tools Bibliographic Explorer What is the Explorer? https://info.arxiv.org/labs/showcase.html arxiv-bibliographic-explorer Connected Papers What is Connected Papers? https://www.connectedpapers.com/about Litmaps What is Litmaps? https://www.litmaps.co/ scite Smart Citations What are Smart Citations? https://www.scite.ai/ Code, Data and Media Associated with this Article alphaXiv What is alphaXiv? https://alphaxiv.org/ CatalyzeX Code Finder for Papers What is CatalyzeX? https://www.catalyzex.com DagsHub What is DagsHub? https://dagshub.com/ Gotit.pub What is GotitPub? http://gotit.pub/faq Hugging Face What is Huggingface? https://huggingface.co/huggingface ScienceCast What is ScienceCast? https://sciencecast.org/welcome Demos Recommenders and Search Tools Influence Flower What are Influence Flowers? https://influencemap.cmlab.dev/ CORE Recommender What is CORE? https://core.ac.uk/services/recommender arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs https://info.arxiv.org/labs/index.html .