This week I shipped Python AI-app regex prefilters in getdebug 0.4.0 and benchmarked them against Bandit and Semgrep on real Python code. Here are the numbers and what each tool actually catches.
Bandit (PyCQA) — the Python-OSS standard security linter. Hand-written rules, free, fast, Python only.
Semgrep — multi-language SAST with community rule packs. Hand-written rules, free, fast.
vulnhuntr (Protect AI, open source) — the stated category leader for LLM-driven AI-app static analysis. Python only.
getdebug — pattern-based regex prefilters in JS/TS + Python (new in 0.4.0). Plus optional local-LLM SAST via Ollama (free, on-device) and hosted (paid).
10 hand-written Python fixtures, 5 vulnerable + 5 safe, one pair per AI-app category (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream, unsafe-tool-output).
Tool TP FP FN Precision Recall
getdebug 5 0 0 100% 100%
bandit 1 1 4 50% 20%
semgrep 1 1 4 50% 20%
vulnhuntr — — — (unable to complete; see below)
Bandit and Semgrep both catch the unsafe-tool-output
fixture via their generic subprocess.run(shell=True)
rules. That's a TP on the vulnerable variant. But they also fire on the safe variant — the allowlist-then-run pattern:
ALLOWED = {"hosts": "cat /etc/hosts", "uptime": "uptime"}
def handle(tool_call):
cmd = ALLOWED.get(tool_call.input.tag)
if not cmd: return "rejected"
return subprocess.run(cmd, shell=True, capture_output=True).stdout
Neither tool knows cmd
came from a static dict, not the model. They see shell=True
and fire. getdebug's regex specifically requires the tool_call.input.X
/ block.input.X
reference in the sink arg, so the allowlist-then-run pattern stays clean.
Both tools miss the other four behavioural categories (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream) entirely. The rule packs don't contain patterns for {"role": "system", "content": f"...{name}..."}
. That's the gap.
We ran all three (working) tools against simonw/llm, Simon Willison's clean CLI for LLMs, 48 Python files.
Tool Total findings Signal
bandit 1,189 1,158 are 'assert_used' (pytest);
zero AI-app coverage
semgrep 3 3 generic-SAST hits;
zero AI-app coverage
getdebug 6 6 AI-app findings: 1 prompt-injection,
5 unbounded-stream
Bandit's 1,189 findings on 48 files is almost entirely the assert_used
warning on pytest assertions — a well-known default everyone disables in real configs. Semgrep's 3 findings are real but none AI-app specific. getdebug's 6 are all AI-app categorized.
vulnhuntr is the stated category leader. We wanted a clean cross-check. We couldn't get one:
--llm claude-code
mode (no-API-key option) crashes with ModuleNotFoundError
in 1.2.2.--llm gpt
with gpt-4o-mini
fails pydantic-validation on the response.--llm gpt
with gpt-4o
hits OpenAI's default 30K TPM rate limit on small accounts.We'll re-benchmark when its 2026 stack stabilises.
If you ship Python code that calls an LLM, run all three. They're complementary:
bandit -r . # general Python hygiene
semgrep --config auto . # cross-language SAST coverage
npx @getdebug/cli@0.4.0 analyze . # AI-app behavioural patterns
None of them subsume the others. The first two catch general SAST; getdebug catches the "serialised the whole user object into the prompt" class that you can't hand-write a sustainable rule for in generic SAST.
Reproduce every number at getdebug.dev/bench. Corpus and methodology are open at getdebug-ai/codesecbench.
read it here https://www.getdebug.dev/blog/python-ai-app-prefilters