{"slug": "show-hn-we-re-inviting-anthropic-to-put-the-real-mythos-5-on-our-open-benchmark", "title": "Show HN: We're inviting Anthropic to put the real Mythos 5 on our open benchmark", "summary": "An open benchmark for code vulnerability scanners shows that LLM-based tools outperform rule-based systems on semantic flaws like SQL injection and command injection, while rule-based tools remain competitive only on syntactic patterns. The dataset includes 697 real vulnerabilities and 120 false positive traps across 26 repositories, with 24 scanners tested. The benchmark's creators are inviting Anthropic to submit their Mythos 5 model for evaluation.", "body_md": "24\n\nScanners\n\n3 categories\n\n26\n\nRepositories\n\nPython · Type 1\n\n92.4\n\nBest F3 (strict)\n\nKolega Enterprise\n\n95.3\n\nHighest recall %\n\nKolega Enterprise\n\n93.2\n\nHighest precision %\n\nGrok 4.20\n\n### Leaderboard\n\nranked by active metric| # | Scanner ▼ | F3 ▼ | Recall % ▼ | Prec % ▼ | Repos ▼ | Cost $ ▼ |\n|---|\n\n### Precision vs. recall\n\nhover a point### Performance vs. cost\n\nF3 vs cost### Recall ranking\n\nfraction of vulnerabilities found### Precision ranking\n\nfraction of flags that were real### By category\n\nthree-tier summary### Detection by vulnerability class\n\nrecall %, best by approach▸ LLM-based scanners dominate classes that need semantic data-flow understanding — SQL injection, command injection, insecure deserialization. ▸ Rule-based tools stay competitive only on syntactic patterns, and even there overall recall remains low.\n\n### Dataset composition\n\n697 vulnerabilities · 120 FP traps · 26 repositories#### Findings\n\nReal vulnerabilities\nFP traps (14.7%)\n\n18\n\nCWE families\n\n20,062\n\nPython LOC\n\n#### Frameworks (26 repos)\n\n#### Scanner categories\n\n5\n\nFrameworks\n\n24\n\nScanners tested\n\nAll figures are live RealVuln results across 24 scanners and 26 repositories. F3 weights recall nine times over precision; strict mode counts unfinished repositories as misses. Cost is the total spend for the scored run (rule-based tools are free or variably priced). [Metric definitions →](methodology.html#scoring)", "url": "https://wpnews.pro/news/show-hn-we-re-inviting-anthropic-to-put-the-real-mythos-5-on-our-open-benchmark", "canonical_source": "https://realvuln.com", "published_at": "2026-06-12 12:44:50+00:00", "updated_at": "2026-06-12 12:48:47.145956+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-safety", "ai-research", "ai-products"], "entities": ["Kolega Enterprise", "Grok 4.20", "Anthropic", "Mythos 5"], "alternates": {"html": "https://wpnews.pro/news/show-hn-we-re-inviting-anthropic-to-put-the-real-mythos-5-on-our-open-benchmark", "markdown": "https://wpnews.pro/news/show-hn-we-re-inviting-anthropic-to-put-the-real-mythos-5-on-our-open-benchmark.md", "text": "https://wpnews.pro/news/show-hn-we-re-inviting-anthropic-to-put-the-real-mythos-5-on-our-open-benchmark.txt", "jsonld": "https://wpnews.pro/news/show-hn-we-re-inviting-anthropic-to-put-the-real-mythos-5-on-our-open-benchmark.jsonld"}}