# Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark

> Source: <https://realvuln.com>
> Published: 2026-06-12 12:44:50+00:00

24

Scanners

3 categories

26

Repositories

Python · Type 1

92.4

Best F3 (strict)

Kolega Enterprise

95.3

Highest recall %

Kolega Enterprise

93.2

Highest precision %

Grok 4.20

### Leaderboard

ranked by active metric| # | Scanner ▼ | F3 ▼ | Recall % ▼ | Prec % ▼ | Repos ▼ | Cost $ ▼ |
|---|

### Precision vs. recall

hover a point### Performance vs. cost

F3 vs cost### Recall ranking

fraction of vulnerabilities found### Precision ranking

fraction of flags that were real### By category

three-tier summary### Detection by vulnerability class

recall %, best by approach▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.

### Dataset composition

697 vulnerabilities · 120 FP traps · 26 repositories#### Findings

Real vulnerabilities
FP traps (14.7%)

18

CWE families

20,062

Python LOC

#### Frameworks (26 repos)

#### Scanner categories

5

Frameworks

24

Scanners tested

All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). [Metric definitions →](methodology.html#scoring)