I benchmarked Python AI-app security scanners. Here's what each catches.

A developer benchmarked four Python AI-application security scanners—Bandit, Semgrep, vulnhuntr, and getdebug—against ten hand-written Python fixtures and the simonw/llm codebase. Getdebug achieved 100% precision and recall on the test fixtures, catching all five AI-app vulnerability categories, while Bandit and Semgrep each detected only one category with 50% precision and 20% recall. Vulnhuntr failed to complete the benchmark due to crashes and rate-limit errors.

This week I shipped Python AI-app regex prefilters in getdebug 0.4.0 and benchmarked them against Bandit and Semgrep on real Python code. Here are the numbers and what each tool actually catches. Bandit PyCQA — the Python-OSS standard security linter. Hand-written rules, free, fast, Python only. Semgrep — multi-language SAST with community rule packs. Hand-written rules, free, fast. vulnhuntr Protect AI, open source — the stated category leader for LLM-driven AI-app static analysis. Python only. getdebug — pattern-based regex prefilters in JS/TS + Python new in 0.4.0 . Plus optional local-LLM SAST via Ollama free, on-device and hosted paid . 10 hand-written Python fixtures, 5 vulnerable + 5 safe, one pair per AI-app category pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream, unsafe-tool-output . Tool TP FP FN Precision Recall getdebug 5 0 0 100% 100% bandit 1 1 4 50% 20% semgrep 1 1 4 50% 20% vulnhuntr — — — unable to complete; see below Bandit and Semgrep both catch the unsafe-tool-output fixture via their generic subprocess.run shell=True rules. That's a TP on the vulnerable variant. But they also fire on the safe variant — the allowlist-then-run pattern: Safe pattern — Bandit + Semgrep both flag this as a FP ALLOWED = {"hosts": "cat /etc/hosts", "uptime": "uptime"} def handle tool call : cmd = ALLOWED.get tool call.input.tag if not cmd: return "rejected" return subprocess.run cmd, shell=True, capture output=True .stdout Neither tool knows cmd came from a static dict, not the model. They see shell=True and fire. getdebug's regex specifically requires the tool call.input.X / block.input.X reference in the sink arg, so the allowlist-then-run pattern stays clean. Both tools miss the other four behavioural categories pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream entirely. The rule packs don't contain patterns for {"role": "system", "content": f"...{name}..."} . That's the gap. We ran all three working tools against simonw/llm https://github.com/simonw/llm , Simon Willison's clean CLI for LLMs, 48 Python files. Tool Total findings Signal bandit 1,189 1,158 are 'assert used' pytest ; zero AI-app coverage semgrep 3 3 generic-SAST hits; zero AI-app coverage getdebug 6 6 AI-app findings: 1 prompt-injection, 5 unbounded-stream Bandit's 1,189 findings on 48 files is almost entirely the assert used warning on pytest assertions — a well-known default everyone disables in real configs. Semgrep's 3 findings are real but none AI-app specific. getdebug's 6 are all AI-app categorized. vulnhuntr is the stated category leader. We wanted a clean cross-check. We couldn't get one: --llm claude-code mode no-API-key option crashes with ModuleNotFoundError in 1.2.2. --llm gpt with gpt-4o-mini fails pydantic-validation on the response. --llm gpt with gpt-4o hits OpenAI's default 30K TPM rate limit on small accounts.We'll re-benchmark when its 2026 stack stabilises. If you ship Python code that calls an LLM, run all three. They're complementary: bandit -r . general Python hygiene semgrep --config auto . cross-language SAST coverage npx @getdebug/cli@0.4.0 analyze . AI-app behavioural patterns None of them subsume the others. The first two catch general SAST; getdebug catches the "serialised the whole user object into the prompt" class that you can't hand-write a sustainable rule for in generic SAST. Reproduce every number at getdebug.dev/bench https://www.getdebug.dev/bench . Corpus and methodology are open at getdebug-ai/codesecbench https://github.com/getdebug-ai/codesecbench . read it here https://www.getdebug.dev/blog/python-ai-app-prefilters https://www.getdebug.dev/blog/python-ai-app-prefilters