cd /news/large-language-models/show-hn-we-re-inviting-anthropic-to-… · home topics large-language-models article
[ARTICLE · art-25178] src=realvuln.com pub= topic=large-language-models verified=true sentiment=· neutral

Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark

An open benchmark for code vulnerability scanners shows that LLM-based tools outperform rule-based systems on semantic flaws like SQL injection and command injection, while rule-based tools remain competitive only on syntactic patterns. The dataset includes 697 real vulnerabilities and 120 false positive traps across 26 repositories, with 24 scanners tested. The benchmark's creators are inviting Anthropic to submit their Mythos 5 model for evaluation.

read1 min publishedJun 12, 2026

24

Scanners

3 categories

26

Repositories

Python · Type 1

92.4

Best F3 (strict) Kolega Enterprise

95.3

Highest recall %

Kolega Enterprise

93.2

Highest precision %

Grok 4.20

Leaderboard

ranked by active metric| # | Scanner ▼ | F3 ▼ | Recall % ▼ | Prec % ▼ | Repos ▼ | Cost $ ▼ |

|---|

Precision vs. recall

hover a point### Performance vs. cost

F3 vs cost### Recall ranking

fraction of vulnerabilities found### Precision ranking

fraction of flags that were real### By category

three-tier summary### Detection by vulnerability class

recall %, best by approach▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.

Dataset composition

697 vulnerabilities · 120 FP traps · 26 repositories#### Findings

Real vulnerabilities

FP traps (14.7%) 18

CWE families

20,062

Python LOC

Frameworks (26 repos)

Scanner categories

5

Frameworks

24

Scanners tested

All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). Metric definitions →

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-we-re-inviti…] indexed:0 read:1min 2026-06-12 ·