{"slug": "i-benchmarked-python-ai-app-security-scanners-here-s-what-each-catches", "title": "I benchmarked Python AI-app security scanners. Here's what each catches.", "summary": "A developer benchmarked four Python AI-application security scanners—Bandit, Semgrep, vulnhuntr, and getdebug—against ten hand-written Python fixtures and the simonw/llm codebase. Getdebug achieved 100% precision and recall on the test fixtures, catching all five AI-app vulnerability categories, while Bandit and Semgrep each detected only one category with 50% precision and 20% recall. Vulnhuntr failed to complete the benchmark due to crashes and rate-limit errors.", "body_md": "This week I shipped Python AI-app regex prefilters in getdebug 0.4.0 and benchmarked them against Bandit and Semgrep on real Python code. Here are the numbers and what each tool actually catches.\n\n**Bandit** (PyCQA) — the Python-OSS standard security linter. Hand-written rules, free, fast, Python only.\n\n**Semgrep** — multi-language SAST with community rule packs. Hand-written rules, free, fast.\n\n**vulnhuntr** (Protect AI, open source) — the stated category leader for LLM-driven AI-app static analysis. Python only.\n\n**getdebug** — pattern-based regex prefilters in JS/TS + Python (new in 0.4.0). Plus optional local-LLM SAST via Ollama (free, on-device) and hosted (paid).\n\n10 hand-written Python fixtures, 5 vulnerable + 5 safe, one pair per AI-app category (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream, unsafe-tool-output).\n\n```\nTool        TP  FP  FN   Precision  Recall\ngetdebug     5   0   0    100%       100%\nbandit       1   1   4    50%        20%\nsemgrep      1   1   4    50%        20%\nvulnhuntr    —   —   —    (unable to complete; see below)\n```\n\nBandit and Semgrep both catch the `unsafe-tool-output`\n\nfixture via their generic `subprocess.run(shell=True)`\n\nrules. That's a TP on the vulnerable variant. But they also fire on the **safe** variant — the allowlist-then-run pattern:\n\n```\n# Safe pattern — Bandit + Semgrep both flag this as a FP\nALLOWED = {\"hosts\": \"cat /etc/hosts\", \"uptime\": \"uptime\"}\ndef handle(tool_call):\n    cmd = ALLOWED.get(tool_call.input.tag)\n    if not cmd: return \"rejected\"\n    return subprocess.run(cmd, shell=True, capture_output=True).stdout\n```\n\nNeither tool knows `cmd`\n\ncame from a static dict, not the model. They see `shell=True`\n\nand fire. getdebug's regex specifically requires the `tool_call.input.X`\n\n/ `block.input.X`\n\nreference in the sink arg, so the allowlist-then-run pattern stays clean.\n\nBoth tools miss the other four behavioural categories (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream) entirely. The rule packs don't contain patterns for `{\"role\": \"system\", \"content\": f\"...{name}...\"}`\n\n. That's the gap.\n\nWe ran all three (working) tools against [simonw/llm](https://github.com/simonw/llm), Simon Willison's clean CLI for LLMs, 48 Python files.\n\n```\nTool        Total findings    Signal\nbandit      1,189            1,158 are 'assert_used' (pytest);\n                              zero AI-app coverage\nsemgrep     3                3 generic-SAST hits;\n                              zero AI-app coverage\ngetdebug    6                6 AI-app findings: 1 prompt-injection,\n                              5 unbounded-stream\n```\n\nBandit's 1,189 findings on 48 files is almost entirely the `assert_used`\n\nwarning on pytest assertions — a well-known default everyone disables in real configs. Semgrep's 3 findings are real but none AI-app specific. getdebug's 6 are all AI-app categorized.\n\nvulnhuntr is the stated category leader. We wanted a clean cross-check. We couldn't get one:\n\n`--llm claude-code`\n\nmode (no-API-key option) crashes with `ModuleNotFoundError`\n\nin 1.2.2.`--llm gpt`\n\nwith `gpt-4o-mini`\n\nfails pydantic-validation on the response.`--llm gpt`\n\nwith `gpt-4o`\n\nhits OpenAI's default 30K TPM rate limit on small accounts.We'll re-benchmark when its 2026 stack stabilises.\n\nIf you ship Python code that calls an LLM, run all three. They're complementary:\n\n```\nbandit -r .                              # general Python hygiene\nsemgrep --config auto .                  # cross-language SAST coverage\nnpx @getdebug/cli@0.4.0 analyze .       # AI-app behavioural patterns\n```\n\nNone of them subsume the others. The first two catch general SAST; getdebug catches the \"serialised the whole user object into the prompt\" class that you can't hand-write a sustainable rule for in generic SAST.\n\nReproduce every number at [getdebug.dev/bench](https://www.getdebug.dev/bench). Corpus and methodology are open at [getdebug-ai/codesecbench](https://github.com/getdebug-ai/codesecbench).\n\nread it here [https://www.getdebug.dev/blog/python-ai-app-prefilters](https://www.getdebug.dev/blog/python-ai-app-prefilters)", "url": "https://wpnews.pro/news/i-benchmarked-python-ai-app-security-scanners-here-s-what-each-catches", "canonical_source": "https://dev.to/onfafanutifafa/i-benchmarked-python-ai-app-security-scanners-heres-what-each-catches-49je", "published_at": "2026-06-05 13:12:00+00:00", "updated_at": "2026-06-05 13:42:44.873589+00:00", "lang": "en", "topics": ["ai-safety", "ai-tools", "ai-research", "artificial-intelligence", "mlops"], "entities": ["Bandit", "Semgrep", "vulnhuntr", "getdebug", "Protect AI", "PyCQA", "Ollama"], "alternates": {"html": "https://wpnews.pro/news/i-benchmarked-python-ai-app-security-scanners-here-s-what-each-catches", "markdown": "https://wpnews.pro/news/i-benchmarked-python-ai-app-security-scanners-here-s-what-each-catches.md", "text": "https://wpnews.pro/news/i-benchmarked-python-ai-app-security-scanners-here-s-what-each-catches.txt", "jsonld": "https://wpnews.pro/news/i-benchmarked-python-ai-app-security-scanners-here-s-what-each-catches.jsonld"}}