# I benchmarked Python AI-app security scanners. Here's what each catches.

> Source: <https://dev.to/onfafanutifafa/i-benchmarked-python-ai-app-security-scanners-heres-what-each-catches-49je>
> Published: 2026-06-05 13:12:00+00:00

This week I shipped Python AI-app regex prefilters in getdebug 0.4.0 and benchmarked them against Bandit and Semgrep on real Python code. Here are the numbers and what each tool actually catches.

**Bandit** (PyCQA) — the Python-OSS standard security linter. Hand-written rules, free, fast, Python only.

**Semgrep** — multi-language SAST with community rule packs. Hand-written rules, free, fast.

**vulnhuntr** (Protect AI, open source) — the stated category leader for LLM-driven AI-app static analysis. Python only.

**getdebug** — pattern-based regex prefilters in JS/TS + Python (new in 0.4.0). Plus optional local-LLM SAST via Ollama (free, on-device) and hosted (paid).

10 hand-written Python fixtures, 5 vulnerable + 5 safe, one pair per AI-app category (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream, unsafe-tool-output).

```
Tool        TP  FP  FN   Precision  Recall
getdebug     5   0   0    100%       100%
bandit       1   1   4    50%        20%
semgrep      1   1   4    50%        20%
vulnhuntr    —   —   —    (unable to complete; see below)
```

Bandit and Semgrep both catch the `unsafe-tool-output`

fixture via their generic `subprocess.run(shell=True)`

rules. That's a TP on the vulnerable variant. But they also fire on the **safe** variant — the allowlist-then-run pattern:

```
# Safe pattern — Bandit + Semgrep both flag this as a FP
ALLOWED = {"hosts": "cat /etc/hosts", "uptime": "uptime"}
def handle(tool_call):
    cmd = ALLOWED.get(tool_call.input.tag)
    if not cmd: return "rejected"
    return subprocess.run(cmd, shell=True, capture_output=True).stdout
```

Neither tool knows `cmd`

came from a static dict, not the model. They see `shell=True`

and fire. getdebug's regex specifically requires the `tool_call.input.X`

/ `block.input.X`

reference in the sink arg, so the allowlist-then-run pattern stays clean.

Both tools miss the other four behavioural categories (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream) entirely. The rule packs don't contain patterns for `{"role": "system", "content": f"...{name}..."}`

. That's the gap.

We ran all three (working) tools against [simonw/llm](https://github.com/simonw/llm), Simon Willison's clean CLI for LLMs, 48 Python files.

```
Tool        Total findings    Signal
bandit      1,189            1,158 are 'assert_used' (pytest);
                              zero AI-app coverage
semgrep     3                3 generic-SAST hits;
                              zero AI-app coverage
getdebug    6                6 AI-app findings: 1 prompt-injection,
                              5 unbounded-stream
```

Bandit's 1,189 findings on 48 files is almost entirely the `assert_used`

warning on pytest assertions — a well-known default everyone disables in real configs. Semgrep's 3 findings are real but none AI-app specific. getdebug's 6 are all AI-app categorized.

vulnhuntr is the stated category leader. We wanted a clean cross-check. We couldn't get one:

`--llm claude-code`

mode (no-API-key option) crashes with `ModuleNotFoundError`

in 1.2.2.`--llm gpt`

with `gpt-4o-mini`

fails pydantic-validation on the response.`--llm gpt`

with `gpt-4o`

hits OpenAI's default 30K TPM rate limit on small accounts.We'll re-benchmark when its 2026 stack stabilises.

If you ship Python code that calls an LLM, run all three. They're complementary:

```
bandit -r .                              # general Python hygiene
semgrep --config auto .                  # cross-language SAST coverage
npx @getdebug/cli@0.4.0 analyze .       # AI-app behavioural patterns
```

None of them subsume the others. The first two catch general SAST; getdebug catches the "serialised the whole user object into the prompt" class that you can't hand-write a sustainable rule for in generic SAST.

Reproduce every number at [getdebug.dev/bench](https://www.getdebug.dev/bench). Corpus and methodology are open at [getdebug-ai/codesecbench](https://github.com/getdebug-ai/codesecbench).

read it here [https://www.getdebug.dev/blog/python-ai-app-prefilters](https://www.getdebug.dev/blog/python-ai-app-prefilters)