Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark

An open benchmark for code vulnerability scanners shows that LLM-based tools outperform rule-based systems on semantic flaws like SQL injection and command injection, while rule-based tools remain competitive only on syntactic patterns. The dataset includes 697 real vulnerabilities and 120 false positive traps across 26 repositories, with 24 scanners tested. The benchmark's creators are inviting Anthropic to submit their Mythos 5 model for evaluation.

24 Scanners 3 categories 26 Repositories Python · Type 1 92.4 Best F3 strict Kolega Enterprise 95.3 Highest recall % Kolega Enterprise 93.2 Highest precision % Grok 4.20 Leaderboard ranked by active metric| | Scanner ▼ | F3 ▼ | Recall % ▼ | Prec % ▼ | Repos ▼ | Cost $ ▼ | |---| Precision vs. recall hover a point Performance vs. cost F3 vs cost Recall ranking fraction of vulnerabilities found Precision ranking fraction of flags that were real By category three-tier summary Detection by vulnerability class recall %, best by approach▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low. Dataset composition 697 vulnerabilities · 120 FP traps · 26 repositories Findings Real vulnerabilities FP traps 14.7% 18 CWE families 20,062 Python LOC Frameworks 26 repos Scanner categories 5 Frameworks 24 Scanners tested All figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run rule-based tools are free or variably priced . Metric definitions → methodology.html scoring