cd /news/ai-safety/i-benchmarked-python-ai-app-security… · home topics ai-safety article
[ARTICLE · art-22545] src=dev.to pub= topic=ai-safety verified=true sentiment=· neutral

I benchmarked Python AI-app security scanners. Here's what each catches.

A developer benchmarked four Python AI-application security scanners—Bandit, Semgrep, vulnhuntr, and getdebug—against ten hand-written Python fixtures and the simonw/llm codebase. Getdebug achieved 100% precision and recall on the test fixtures, catching all five AI-app vulnerability categories, while Bandit and Semgrep each detected only one category with 50% precision and 20% recall. Vulnhuntr failed to complete the benchmark due to crashes and rate-limit errors.

read3 min publishedJun 5, 2026

This week I shipped Python AI-app regex prefilters in getdebug 0.4.0 and benchmarked them against Bandit and Semgrep on real Python code. Here are the numbers and what each tool actually catches.

Bandit (PyCQA) — the Python-OSS standard security linter. Hand-written rules, free, fast, Python only.

Semgrep — multi-language SAST with community rule packs. Hand-written rules, free, fast.

vulnhuntr (Protect AI, open source) — the stated category leader for LLM-driven AI-app static analysis. Python only.

getdebug — pattern-based regex prefilters in JS/TS + Python (new in 0.4.0). Plus optional local-LLM SAST via Ollama (free, on-device) and hosted (paid).

10 hand-written Python fixtures, 5 vulnerable + 5 safe, one pair per AI-app category (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream, unsafe-tool-output).

Tool        TP  FP  FN   Precision  Recall
getdebug     5   0   0    100%       100%
bandit       1   1   4    50%        20%
semgrep      1   1   4    50%        20%
vulnhuntr    —   —   —    (unable to complete; see below)

Bandit and Semgrep both catch the unsafe-tool-output

fixture via their generic subprocess.run(shell=True)

rules. That's a TP on the vulnerable variant. But they also fire on the safe variant — the allowlist-then-run pattern:

ALLOWED = {"hosts": "cat /etc/hosts", "uptime": "uptime"}
def handle(tool_call):
    cmd = ALLOWED.get(tool_call.input.tag)
    if not cmd: return "rejected"
    return subprocess.run(cmd, shell=True, capture_output=True).stdout

Neither tool knows cmd

came from a static dict, not the model. They see shell=True

and fire. getdebug's regex specifically requires the tool_call.input.X

/ block.input.X

reference in the sink arg, so the allowlist-then-run pattern stays clean.

Both tools miss the other four behavioural categories (pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream) entirely. The rule packs don't contain patterns for {"role": "system", "content": f"...{name}..."}

. That's the gap.

We ran all three (working) tools against simonw/llm, Simon Willison's clean CLI for LLMs, 48 Python files.

Tool        Total findings    Signal
bandit      1,189            1,158 are 'assert_used' (pytest);
                              zero AI-app coverage
semgrep     3                3 generic-SAST hits;
                              zero AI-app coverage
getdebug    6                6 AI-app findings: 1 prompt-injection,
                              5 unbounded-stream

Bandit's 1,189 findings on 48 files is almost entirely the assert_used

warning on pytest assertions — a well-known default everyone disables in real configs. Semgrep's 3 findings are real but none AI-app specific. getdebug's 6 are all AI-app categorized.

vulnhuntr is the stated category leader. We wanted a clean cross-check. We couldn't get one:

--llm claude-code

mode (no-API-key option) crashes with ModuleNotFoundError

in 1.2.2.--llm gpt

with gpt-4o-mini

fails pydantic-validation on the response.--llm gpt

with gpt-4o

hits OpenAI's default 30K TPM rate limit on small accounts.We'll re-benchmark when its 2026 stack stabilises.

If you ship Python code that calls an LLM, run all three. They're complementary:

bandit -r .                              # general Python hygiene
semgrep --config auto .                  # cross-language SAST coverage
npx @getdebug/cli@0.4.0 analyze .       # AI-app behavioural patterns

None of them subsume the others. The first two catch general SAST; getdebug catches the "serialised the whole user object into the prompt" class that you can't hand-write a sustainable rule for in generic SAST.

Reproduce every number at getdebug.dev/bench. Corpus and methodology are open at getdebug-ai/codesecbench.

read it here https://www.getdebug.dev/blog/python-ai-app-prefilters

── more in #ai-safety 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-benchmarked-python…] indexed:0 read:3min 2026-06-05 ·