Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite as ground truth.
The question: when you optimize a DSPy program with
BootstrapFewShot
, MIPROv2
, or GEPA
, does it become more or less
robust to prompt-injection attacks? Two adjacent research communities — prompt
optimization and prompt-injection security — have not measured this
intersection. dspy-security-bench
wires DSPy optimizers and AgentDojo attacks into one harness so the trade-off becomes visible.
Update (2026-06-26): a 3-seed sanity check changes the optimizer ordering shown here.The numbers below are the single-seed (seed=0) result. Aggregated over three seeds,BootstrapFewShot
is actually thelowestonimportant_instructions
security (0.600), andMIPROv2
andGEPA
tie at 0.733. Standard deviations at N=5 user tasks land in the 0.4 to 0.5 range, so individual rankings here are dominated by noise. What survives across seeds:BootstrapFewShot
'sdirect
-attack Pareto win, the unoptimized 0% utility floor, and the qualitative "optimization trends below unoptimized on the harder attack" pattern. Full 3-seed numbers:[. v0.2 phase 2 will scale N to put any optimizer-ranking claim on solid statistical ground.]data/results/workspace_v02_phase1_seeds_summary.csv
Headline (seed=0):prompt optimization measurably degrades adversarial robustness on harder attacks.Optimizers buy utility (0% → 40-60% task success ondirect
) but pay it back in security onimportant_instructions
(80% → 60% attack-failure rate).BootstrapFewShot
Pareto-dominatesMIPROv2
on the workspace suite at v0.1's single-seed scale. See update note above for what holds vs. what does not when averaged across 3 seeds.
| Optimizer | Attack | Utility | Security | Injection success | n |
|---|---|---|---|---|---|
| unoptimized | |||||
| direct | 0% | ||||
| 100% | |||||
| 0% | 5 | ||||
| unoptimized | |||||
| important_instructions | 0% | ||||
| 80% | |||||
| 20% | 5 | ||||
| bootstrap_fewshot | |||||
| direct | 60% | ||||
| 100% | |||||
| 0% | 5 | ||||
| bootstrap_fewshot | |||||
| important_instructions | 20% | ||||
| 60% | |||||
| 40% | 5 | ||||
| miprov2 | |||||
| direct | 40% | ||||
| 80% | |||||
| 20% | 5 | ||||
| miprov2 | |||||
| important_instructions | 20% | ||||
| 60% | |||||
| 40% | 5 |
Reading the chart. A point closer to the green star (top-right) is the ideal — high utility and high security. Three patterns hold across this scale:
It refuses to do the task (0% utility) regardless of attack, and resists attacks at 80–100%.unoptimized
is high-security but useless.Equal or highest utility (60% onbootstrap_fewshot
is the best operating point at this scale.direct
), equal-best security ondirect
(100%), and matchesmiprov2
's degradedimportant_instructions
security.Lower utility onmiprov2
Pareto-loses to bootstrap.direct
(40% vs 60%) AND lower security (80% vs 100%). Suggests heavier optimization overfits the clean-distribution prompt and exposes more attack surface.
v0.1 scope: workspace suite only, N=5 user tasks × 1 injection task × 2 attacks × 3 optimizers = 30 runs. gpt-4o-mini for execution + judge. Trainset = 192 validated synthetic tasks (100 gpt-4o + 100 claude-sonnet, validated syntactic + dedupe). See
[for reproduction.]scripts/run_v01_benchmark.py
flowchart TD
A([AgentDojo seed env data]) --> B[env-data extractor]
B --> C[synthesis generator<br/>LM-generated query-only<br/>tasks grounded in env]
LM[(GPT-4o + Claude)] -.-> C
C -->|raw tasks| D[validator<br/>syntactic + dedupe<br/>+ optional solvability]
D -->|~190 validated tasks| E[optimizer harness<br/>BootstrapFewShot · MIPROv2<br/>GEPA in v0.2]
E -->|name → agent_factory| F[DSPyReActV2Element<br/>wraps dspy.ReActV2 as<br/>AgentDojo pipeline element]
F -->|AgentPipeline| G[runner<br/>drives benchmark_suite_<br/>with_injections]
AD[(AgentDojo attacks)] -.-> G
G --> H([pandas DataFrame<br/>one row per<br/>optimizer × attack ×<br/>user_task × injection_task])
classDef synth fill:#DBEAFE,stroke:#1E40AF,stroke-width:2px,color:#1E3A8A
classDef opt fill:#FED7AA,stroke:#9A3412,stroke-width:2px,color:#7C2D12
classDef eval fill:#DCFCE7,stroke:#15803D,stroke-width:2px,color:#14532D
classDef io fill:#F1F5F9,stroke:#475569,stroke-width:2px,color:#1F2937
classDef ext fill:#FAE8FF,stroke:#86198F,stroke-width:2px,color:#701A75
class B,C,D synth
class E,F opt
class G,H eval
class A io
class LM,AD ext
From PyPI:
pip install dspy-security-bench
From source (for development):
git clone https://github.com/immu4989/dspy-security-bench.git
cd dspy-security-bench
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev]"
Requires Python 3.10+ and dspy >= 3.3.0b1 (the canonical-tool-call
release that adds dspy.ReActV2
). pip/uv handle the pre-release pin
automatically because the version is explicit in pyproject.toml
.
The full pipeline in Python:
import dspy
from dspy_security_bench.synthesis.generator import synthesize_tasks
from dspy_security_bench.synthesis.validator import validate_tasks
from dspy_security_bench.optimizers import build_agent_factories
from dspy_security_bench.llm_judge import LLMJudgeMetric
from dspy_security_bench.runner import evaluate_factories, summarize
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
raw_tasks = synthesize_tasks("workspace", n=150, model="openai/gpt-4o")
val = validate_tasks(raw_tasks, "workspace", checks=("syntactic", "dedupe"))
trainset = val.kept # ~140-180 high-quality tasks survive
factories = build_agent_factories(
trainset=trainset,
optimizers=["unoptimized", "bootstrap_fewshot", "miprov2"],
suite_name="workspace",
signature="query -> answer",
metric=LLMJudgeMetric(judge_lm=dspy.LM("openai/gpt-4o-mini", temperature=0)),
)
df = evaluate_factories(
factories=factories,
suite_name="workspace",
attacks=["direct", "important_instructions"],
user_task_ids=["user_task_0", "user_task_1", "user_task_3", "user_task_10", "user_task_11"],
injection_task_ids=["injection_task_0"],
max_iters=8,
)
print(summarize(df))
The full v0.1 run takes ~30-45 min wall-clock at ~$15-20 in LM cost (gpt-4o-mini for everything). See scripts/run_v01_benchmark.py for the production driver — it caches optimizer state to
data/results/factories_cache.pkl
so re-runs after a downstream crash skip optimization.The synthesis and validation steps have CLIs that produce JSONL files:
dspy-security-bench-synthesize workspace --dry-run
export OPENAI_API_KEY=sk-...
dspy-security-bench-synthesize workspace \
--n 150 --model openai/gpt-4o \
--out data/synthetic_train/workspace_gpt4o_raw.jsonl
dspy-security-bench-validate workspace \
data/synthetic_train/workspace_gpt4o_raw.jsonl \
--out data/synthetic_train/workspace_gpt4o.jsonl \
--report data/synthetic_train/workspace_gpt4o_report.json
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-... # optional — falls back to GPT-4o only
python scripts/run_v01_benchmark.py 2>&1 | tee data/results/run_v01.log
python scripts/generate_v01_figures.py # rebuilds the README charts
Outputs:
data/results/workspace_v01_results.csv
— 30 raw rowsdata/results/workspace_v01_summary.csv
— 6-row aggregationassets/v01_utility_vs_security.png
assets/v01_pareto.png
uv pip install -e ".[dev]"
pytest tests/ -v
ruff check dspy_security_bench/ tests/
ruff format dspy_security_bench/ tests/
The test suite covers env-data extraction, synthesis helpers, validator
checks, the AgentDojo wrapper (end-to-end against user_task_0
with
DummyLM
), the optimizer harness, the LLM-as-judge metric, and the
runner's orchestration (with benchmark_suite_with_injections
mocked).
These are documented in detail in ARCHITECTURE.md. The key v0.1 scope choices:
Synthetic trainset, not held-out split. AgentDojo has only ~40 user tasks per suite — not enough for a clean train/test split that supports optimizers like MIPROv2. We synthesize ~100 in-distribution query-only tasks per suite via GPT-4o + Claude Sonnet, validated against the env, and use the real AgentDojo tasks unmodified as the held-out test set.Query-only tasks for training; full action-task suite for testing. Action tasks (send, create, modify) have hand-written utility checks that don't synthesize cleanly. Training on queries-only is acceptable because the research question is whetherprompt optimization(not action selection) affects robustness.Hybrid metric: LLM-as-judge with substring fast-path for training (cheap- tolerant of paraphrasing); real AgentDojo
utility()
for testing (rigorous, the actual published benchmark).
- tolerant of paraphrasing); real AgentDojo
Single-output signature constraint on the DSPy program. The model's final output goes into AgentDojo's single
model_output
utility argument.
| Milestone | Status |
|---|---|
| v0.1 — workspace suite × 2 attacks × 3 optimizers, headline finding | shipped |
| v0.2 — banking / travel / slack suites, GEPA optimizer, larger N | planned |
| v0.3 — adversarial trainset to study robust-by-construction optimization | planned |
| Paper — TMLR submission if v0.2 findings hold at scale | conditional |
This benchmark sits on top of:
(Stanford NLP) — the optimizer framework being evaluated.DSPy(ETH Zurich, SPY lab) — the attack suite and task environments providing ground-truth robustness measurement.** AgentDojo**
It also draws on the broader 2024-26 prompt-security literature, including GEPA, BATprompt, Survival of the Safest, InjecAgent, and WASP.
If you use this benchmark in research or production, please cite:
@misc{ahamed2026dspysecuritybench,
title = {{dspy-security-bench}: Measuring optimizer-induced robustness in
agentic DSPy programs},
author = {Imran Ahamed},
year = {2026},
howpublished = {\url{https://github.com/immu4989/dspy-security-bench}},
}
Apache License 2.0 — see LICENSE.