{"slug": "show-hn-an-open-source-benchmark-for-prompt-injection-detectors", "title": "Show HN: An open source benchmark for prompt-injection detectors", "summary": "An open-source benchmark for prompt-injection detectors has been released, measuring both attack catch rate and false positives on real traffic in a threshold-agnostic way. The benchmark evaluates ten detectors across four adversarial datasets, with reproducible results from raw scores. Bastion Soft's 70M-parameter model leads with 0.991 AUC and 1.24% false positives at threshold 0.5, while Meta's deprecated prompt-guard shows 0.314 AUC and 88.30% false positives.", "body_md": "**An open, model-agnostic benchmark for prompt-injection detectors — measured on both axes (attack catch-rate and false positives on real traffic), threshold-agnostically, and reproducible from raw scores.**\n\nMost prompt-injection benchmarks measure one thing: can a detector spot an attack? That's half the story. A detector that flags one in four *normal* user messages is an outage, not a guardrail — and a detector tuned to look great at one threshold can fall apart at another. This benchmark measures **both axes** and compares detectors **at the same catch rate**, so the ranking doesn't depend on where any model's 0.5 happens to fall.\n\nThis benchmark is maintained by\n\n, who also ship one of the evaluated detectors ([Bastion Soft]`bastion-prompt-protection`\n\n). We keep it honest by design:\n\nEvery number is reproducible from committed raw scoreswith no GPU — rerun and check.- Bastion's model is scored through the\nidentical generic pathas every other model (a plain HuggingFace classifier — no special handling).- The interpretation doc\ndocuments where our own model is weak(see[).]`results/FINDINGS.md`\n\nContributions and criticism are welcome— add your detector, propose a methodology change, or challenge a result.\n\nThe false-alarm rate (X axis) vs detection rate (Y axis) for different detectors. You want to benchmark against attack datasets that make sense for your domain and against your real historic traffic.\n\nAn [article on Medium](https://medium.com/@mantas.urbonas/measuring-prompt-injection-defences-e79b79471846) provides more context about the need for LLM prompt injection detectors in 2026 and motivation behind this kind of benchmarks.\n\nTen open detectors, four held-out adversarial benchmarks. Full tables + latency in [ results/leaderboard.md](/bastion-soft/pi-detector-bench/blob/main/results/leaderboard.md); these are\n\n**seed results**—\n\n[add your model](/bastion-soft/pi-detector-bench/blob/main/CONTRIBUTING.md).\n\n| Detector | Params | Detection (avg AUC) | False positives (real traffic, @0.5) | FPR @ 95% catch |\n|---|---|---|---|---|\n| bastion-prompt-protection | 70M | 0.991 | 1.24% | 7.71% |\n| sentinel (qualifire) | 395M | 0.955 | 23.60% | 46.30% |\n| wolf-defender | 0.3B | 0.954 | 24.03% | 34.63% |\n| hlyn judge | 70M | 0.950 | 21.67% | 77.12% |\n| wolf-defender-small | 0.1B | 0.941 | 28.79% | 43.79% |\n| proventra mdeberta | 280M | 0.843 | 21.83% | 82.22% |\n| protectai v2 | 184M | 0.820 | 8.82% | 100.00% |\n| deepset injection | 184M | 0.766 | 65.89% | 69.44% |\n| fmops distilbert | 67M | 0.700 | 64.98% | 74.64% |\n| meta prompt-guard† | 86M | 0.314 | 88.30% | 85.77% |\n\nWhat this table is *for*: notice that detectors close on AUC are nowhere close on false positives, and that some low-FPR-at-0.5 numbers are bought by under-catching (visible in the \"@95% catch\" column). **Read ** — including each detector's weak spots. †\n\n`results/FINDINGS.md`\n\nfor the honest interpretation`meta prompt-guard`\n\nis a deprecated, over-firing model kept for context (see FINDINGS).*False positives on real traffic as the decision threshold moves — a flat line is threshold-robust, a steep one is brittle. This is why a single fixed-threshold number can mislead, and why we also compare at a fixed catch rate. Full reading + the operating curve: *\n\n`results/FINDINGS.md`\n\n.It's not another attack dataset — it's a **methodology** (see [ METHODOLOGY.md](/bastion-soft/pi-detector-bench/blob/main/METHODOLOGY.md)):\n\n**Two axes.** Detection (does it catch attacks?)**and** false positives on**real chat traffic**(WildChat + LMSYS), not synthetic benigns. A number on one axis alone is close to meaningless.** Threshold-agnostic.**Beyond the fixed-0.5 view, we report** FPR at a fixed detection rate**(tune each detector to catch 95% of attacks, compare the false-alarm cost),** EER**, and full** operating curves**— so no detector is helped or hurt by where its 0.5 falls.** Indirect / structured injection.**A separate axis for injection hidden inside data (documents, JSON, tool output), with a structured-data false-positive measure.**Reproducible from raw scores.** Every detector's per-prompt scores are committed (`results/scores/`\n\n), so all tables, curves, and operating points recompute offline with no GPU. The exact published numbers don't depend on trusting a GPU run.\n\n```\ngit clone https://github.com/bastion-soft/pi-detector-bench.git\ncd pi-detector-bench\npip install -e .\nhuggingface-cli login          # optional — only for gated entries/datasets\n\npython -m scripts.run_leaderboard          --dump-scores results/scores            # detection\npython -m scripts.measure_false_positives  --dump-scores results/scores            # false positives\npython -m scripts.eval_indirect            --dump-scores results/scores_indirect   # indirect/structured\n\n# Post-processing — no GPU, recomputes every table from the dumped scores:\npython -m scripts.rebuild_results_from_scores\npython -m scripts.analyze_operating_points\npython -m scripts.analyze_operating_points --scores-dir results/scores_indirect --within-set --label indirect\npython -m scripts.plot_operating_points    # optional curves (pip install -e \".[plot]\")\n```\n\nNo GPU? Run the whole suite on a free Colab T4 — open [ notebooks/benchmark_colab.ipynb](/bastion-soft/pi-detector-bench/blob/main/notebooks/benchmark_colab.ipynb).\n\nIt's a one-file PR. Append your model to [ models.yaml](/bastion-soft/pi-detector-bench/blob/main/models.yaml):\n\n```\n  - name: \"my-detector (220M)\"\n    hf_id: \"my-org/my-prompt-injection-detector\"\n    attack_label: 1        # softmax index meaning \"attack\" (or a list to sum)\n    params: \"220M\"\n```\n\n…then run the harness and include the new result rows + scores. Full guide — including how to propose a **methodology change** or add a **dataset** — in [ CONTRIBUTING.md](/bastion-soft/pi-detector-bench/blob/main/CONTRIBUTING.md).\n\n— how detectors are scored, and why (the two-axis / threshold-agnostic design).`METHODOLOGY.md`\n\n— honest interpretation of the seed results, with graphs and per-detector weak spots.`results/FINDINGS.md`\n\n— add a detector, a dataset, or a methodology change.`CONTRIBUTING.md`\n\nCode: **MIT** (see [ LICENSE](/bastion-soft/pi-detector-bench/blob/main/LICENSE)). Evaluation datasets retain their own licenses — some are gated and require accepting terms on the HuggingFace Hub. Committed results contain only per-prompt scores and labels, never dataset prompt text.", "url": "https://wpnews.pro/news/show-hn-an-open-source-benchmark-for-prompt-injection-detectors", "canonical_source": "https://github.com/bastion-soft/pi-detector-bench", "published_at": "2026-06-29 10:18:13+00:00", "updated_at": "2026-06-29 10:29:28.032429+00:00", "lang": "en", "topics": ["ai-safety", "large-language-models", "generative-ai", "ai-research", "machine-learning"], "entities": ["Bastion Soft", "Meta", "HuggingFace", "WildChat", "LMSYS", "Bastion Prompt Protection", "Sentinel", "Wolf Defender"], "alternates": {"html": "https://wpnews.pro/news/show-hn-an-open-source-benchmark-for-prompt-injection-detectors", "markdown": "https://wpnews.pro/news/show-hn-an-open-source-benchmark-for-prompt-injection-detectors.md", "text": "https://wpnews.pro/news/show-hn-an-open-source-benchmark-for-prompt-injection-detectors.txt", "jsonld": "https://wpnews.pro/news/show-hn-an-open-source-benchmark-for-prompt-injection-detectors.jsonld"}}