Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark

wpnews.pro

cd /news/large-language-models/show-hn-we-re-inviting-anthropic-to-… · home › topics › large-language-models › article

[ARTICLE · art-25178] src=realvuln.com ↗ pub=2026-06-12T12:44Z topic=large-language-models verified=true sentiment=· neutral

Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark

An open benchmark for code vulnerability scanners shows that LLM-based tools outperform rule-based systems on semantic flaws like SQL injection and command injection, while rule-based tools remain competitive only on syntactic patterns. The dataset includes 697 real vulnerabilities and 120 false positive traps across 26 repositories, with 24 scanners tested. The benchmark's creators are inviting Anthropic to submit their Mythos 5 model for evaluation.

read1 min views19 publishedJun 12, 2026

Scanners

3 categories

Repositories

Python · Type 1

92.4

Best F3 (strict) Kolega Enterprise

95.3

Highest recall %

Kolega Enterprise

93.2

Highest precision %

Grok 4.20

Leaderboard

|---|

Precision vs. recall

hover a point### Performance vs. cost

F3 vs cost### Recall ranking

fraction of vulnerabilities found### Precision ranking

fraction of flags that were real### By category

three-tier summary### Detection by vulnerability class

recall %, best by approach▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.

Dataset composition

697 vulnerabilities · 120 FP traps · 26 repositories#### Findings

Real vulnerabilities

FP traps (14.7%) 18

CWE families

20,062

Python LOC

Frameworks (26 repos)

Scanner categories

Frameworks

Scanners tested

All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). Metric definitions →

source & further reading

realvuln.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/show-hn-we-re-inviting-a…

Read original on realvuln.com → realvuln.com

mentioned entities

Kolega Enterprise

Grok 4.20

Anthropic

Mythos 5

metadata

slugshow-hn-we-re-inviting-anthropic-to-put-the-real-mythos-5-on-our-open-benchmark

topic#large-language-models

secondary4 topics

sentimentneutral

canonicalrealvuln.com

navigation

← prevGoogle sues China-based scammers…

next →OpenAI acquires Ona to run Codex…

── more in #large-language-models 4 stories · sorted by recency

byteiota.com · 29 Jul · #large-language-models

Claude Breaks Post-Quantum HAWK Cipher in Just 60 Hours

graybeard.ing · 28 Jul · #large-language-models

Kimi K3 is insane

shannadige.com · 28 Jul · #large-language-models

There's Hope in Hard Truths

siliconangle.com · 28 Jul · #large-language-models

AI researchers call for new tools that can slow automated model development

── more on @kolega enterprise 3 stories trending now

wpnews · 16 Jul · #artificial-intelligence

Women entrepreneurs are less likely to leverage AI—but more likely to benefit from it

wpnews · 26 Jul · #ai-safety

University of Washington study reveals prompt injection risks lurking in AI agent memory

wpnews · 28 Jul · #artificial-intelligence

How Claude Code and VS Code turned Anthropic from a safety lab into a developer phenomenon

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required