cd /news/ai-safety/show-hn-an-open-source-benchmark-for… · home topics ai-safety article
[ARTICLE · art-43257] src=github.com ↗ pub= topic=ai-safety verified=true sentiment=· neutral

Show HN: An open source benchmark for prompt-injection detectors

An open-source benchmark for prompt-injection detectors has been released, measuring both attack catch rate and false positives on real traffic in a threshold-agnostic way. The benchmark evaluates ten detectors across four adversarial datasets, with reproducible results from raw scores. Bastion Soft's 70M-parameter model leads with 0.991 AUC and 1.24% false positives at threshold 0.5, while Meta's deprecated prompt-guard shows 0.314 AUC and 88.30% false positives.

read4 min views1 publishedJun 29, 2026
Show HN: An open source benchmark for prompt-injection detectors
Image: source

An open, model-agnostic benchmark for prompt-injection detectors — measured on both axes (attack catch-rate and false positives on real traffic), threshold-agnostically, and reproducible from raw scores.

Most prompt-injection benchmarks measure one thing: can a detector spot an attack? That's half the story. A detector that flags one in four normal user messages is an outage, not a guardrail — and a detector tuned to look great at one threshold can fall apart at another. This benchmark measures both axes and compares detectors at the same catch rate, so the ranking doesn't depend on where any model's 0.5 happens to fall.

This benchmark is maintained by

, who also ship one of the evaluated detectors ([Bastion Soft]bastion-prompt-protection

). We keep it honest by design:

Every number is reproducible from committed raw scoreswith no GPU — rerun and check.- Bastion's model is scored through the identical generic pathas every other model (a plain HuggingFace classifier — no special handling).- The interpretation doc documents where our own model is weak(see[).]results/FINDINGS.md

Contributions and criticism are welcome— add your detector, propose a methodology change, or challenge a result.

The false-alarm rate (X axis) vs detection rate (Y axis) for different detectors. You want to benchmark against attack datasets that make sense for your domain and against your real historic traffic.

An article on Medium provides more context about the need for LLM prompt injection detectors in 2026 and motivation behind this kind of benchmarks.

Ten open detectors, four held-out adversarial benchmarks. Full tables + latency in results/leaderboard.md; these are

seed results

add your model.

Detector Params Detection (avg AUC) False positives (real traffic, @0.5) FPR @ 95% catch
bastion-prompt-protection 70M 0.991 1.24% 7.71%
sentinel (qualifire) 395M 0.955 23.60% 46.30%
wolf-defender 0.3B 0.954 24.03% 34.63%
hlyn judge 70M 0.950 21.67% 77.12%
wolf-defender-small 0.1B 0.941 28.79% 43.79%
proventra mdeberta 280M 0.843 21.83% 82.22%
protectai v2 184M 0.820 8.82% 100.00%
deepset injection 184M 0.766 65.89% 69.44%
fmops distilbert 67M 0.700 64.98% 74.64%
meta prompt-guard† 86M 0.314 88.30% 85.77%

What this table is for: notice that detectors close on AUC are nowhere close on false positives, and that some low-FPR-at-0.5 numbers are bought by under-catching (visible in the "@95% catch" column). **Read ** — including each detector's weak spots. †

results/FINDINGS.md

for the honest interpretationmeta prompt-guard

is a deprecated, over-firing model kept for context (see FINDINGS).*False positives on real traffic as the decision threshold moves — a flat line is threshold-robust, a steep one is brittle. This is why a single fixed-threshold number can mislead, and why we also compare at a fixed catch rate. Full reading + the operating curve: *

results/FINDINGS.md

.It's not another attack dataset — it's a methodology (see METHODOLOGY.md):

Two axes. Detection (does it catch attacks?)and false positives onreal chat traffic(WildChat + LMSYS), not synthetic benigns. A number on one axis alone is close to meaningless.** Threshold-agnostic.Beyond the fixed-0.5 view, we report FPR at a fixed detection rate**(tune each detector to catch 95% of attacks, compare the false-alarm cost),** EER**, and full** operating curves**— so no detector is helped or hurt by where its 0.5 falls.** Indirect / structured injection.**A separate axis for injection hidden inside data (documents, JSON, tool output), with a structured-data false-positive measure.Reproducible from raw scores. Every detector's per-prompt scores are committed (results/scores/

), so all tables, curves, and operating points recompute offline with no GPU. The exact published numbers don't depend on trusting a GPU run.

git clone https://github.com/bastion-soft/pi-detector-bench.git
cd pi-detector-bench
pip install -e .
huggingface-cli login          # optional — only for gated entries/datasets

python -m scripts.run_leaderboard          --dump-scores results/scores            # detection
python -m scripts.measure_false_positives  --dump-scores results/scores            # false positives
python -m scripts.eval_indirect            --dump-scores results/scores_indirect   # indirect/structured

python -m scripts.rebuild_results_from_scores
python -m scripts.analyze_operating_points
python -m scripts.analyze_operating_points --scores-dir results/scores_indirect --within-set --label indirect
python -m scripts.plot_operating_points    # optional curves (pip install -e ".[plot]")

No GPU? Run the whole suite on a free Colab T4 — open notebooks/benchmark_colab.ipynb.

It's a one-file PR. Append your model to models.yaml:

  - name: "my-detector (220M)"
    hf_id: "my-org/my-prompt-injection-detector"
    attack_label: 1        # softmax index meaning "attack" (or a list to sum)
    params: "220M"

…then run the harness and include the new result rows + scores. Full guide — including how to propose a methodology change or add a dataset — in CONTRIBUTING.md.

— how detectors are scored, and why (the two-axis / threshold-agnostic design).METHODOLOGY.md

— honest interpretation of the seed results, with graphs and per-detector weak spots.results/FINDINGS.md

— add a detector, a dataset, or a methodology change.CONTRIBUTING.md

Code: MIT (see LICENSE). Evaluation datasets retain their own licenses — some are gated and require accepting terms on the HuggingFace Hub. Committed results contain only per-prompt scores and labels, never dataset prompt text.

── more in #ai-safety 4 stories · sorted by recency
── more on @bastion soft 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-an-open-sour…] indexed:0 read:4min 2026-06-29 ·