A Red-Team Study of Anthropic Fable 5 and Opus 4.8 Models

wpnews.pro

cd /news/large-language-models/a-red-team-study-of-anthropic-fable-… · home › topics › large-language-models › article

[ARTICLE · art-30597] src=arxiv.org ↗ pub=2026-06-17T04:56Z topic=large-language-models verified=true sentiment=↓ negative

A Red-Team Study of Anthropic Fable 5 and Opus 4.8 Models

A red-team study of Anthropic's Fable 5 and Opus 4.8 large language models found that both models remain reliably breakable under sustained automated pressure, with Opus 4.8 producing 1,620 and Fable 5 producing 702 panel-confirmed harmful completions across all harm categories. The study used the HackAgent framework to generate hundreds of thousands of adversarial attempts, revealing that adaptive iterative attacks dominate the residual vulnerability surface while static obfuscation is nearly fully neutralized.

read2 min views22 publishedJun 17, 2026

[Submitted on 16 Jun 2026]


[View PDF](/pdf/2606.18193)

[HTML (experimental)](https://arxiv.org/html/2606.18193v1)

Abstract:We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

Current browse context:

cs.CR

References & Citations

...

Bibliographic Explorer

(What is the Explorer?) Connected Papers

(What is Connected Papers?) Litmaps

(What is Litmaps?) scite Smart Citations

(What are Smart Citations?)# Code, Data and Media Associated with this Article alphaXiv

(What is alphaXiv?) CatalyzeX Code Finder for Papers

(What is CatalyzeX?) DagsHub

(What is DagsHub?) Gotit.pub

(What is GotitPub?) Hugging Face

(What is Huggingface?) ScienceCast

(What is ScienceCast?)# Demos Influence Flower

(What are Influence Flowers?) CORE Recommender

(What is CORE?)# arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/a-red-team-study-of-anth…

Read original on arxiv.org → arxiv.org/abs/2606.18193

mentioned entities

Anthropic

Fable 5

Opus 4.8

HackAgent

metadata

sluga-red-team-study-of-anthropic-fable-5-and-opus-4-8-models

topic#large-language-models

secondary2 topics

sentimentnegative

canonicalarxiv.org

navigation

← prevThe boring 80% nobody warns you …

next →From AI Job Board to Talent Pool…

── more in #large-language-models 4 stories · sorted by recency

pub.towardsai.net · 1 Aug · #large-language-models

Anthropic’s Claude Opus 5: Engineering Agentic Persistence and Dynamic Effort in Frontier LLMs

pub.towardsai.net · 1 Aug · #large-language-models

Kimi K3 beats Opus 4.8 but costs the same as Sonnet 5: The End of the "Open Equals Cheap" Era

runtimewire.com · 1 Aug · #large-language-models

Claude Code reverse engineering alleges hidden Fable 5 model switches

byteiota.com · 31 Jul · #large-language-models

Claude Opus 5: Two Breaking Changes and What to Fix Before Migrating

── more on @anthropic 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required