Does DSPy prompt optimization weaken adversarial robustness?

wpnews.pro

Measure how DSPy prompt optimization affects the prompt-injection robustness of agentic LLM programs, using AgentDojo's attack suite as ground truth.

The question: when you optimize a DSPy program with BootstrapFewShot

, MIPROv2

, or GEPA

, does it become more or less robust to prompt-injection attacks? Two adjacent research communities — prompt optimization and prompt-injection security — have not measured this intersection. dspy-security-bench

wires DSPy optimizers and AgentDojo attacks into one harness so the trade-off becomes visible.

Update (2026-06-26): a 3-seed sanity check changes the optimizer ordering shown here.The numbers below are the single-seed (seed=0) result. Aggregated over three seeds,BootstrapFewShot

is actually thelowestonimportant_instructions

security (0.600), andMIPROv2

andGEPA

tie at 0.733. Standard deviations at N=5 user tasks land in the 0.4 to 0.5 range, so individual rankings here are dominated by noise. What survives across seeds:BootstrapFewShot

'sdirect

-attack Pareto win, the unoptimized 0% utility floor, and the qualitative "optimization trends below unoptimized on the harder attack" pattern. Full 3-seed numbers:[. v0.2 phase 2 will scale N to put any optimizer-ranking claim on solid statistical ground.]data/results/workspace_v02_phase1_seeds_summary.csv

Headline (seed=0):prompt optimization measurably degrades adversarial robustness on harder attacks.Optimizers buy utility (0% → 40-60% task success ondirect

) but pay it back in security onimportant_instructions

(80% → 60% attack-failure rate).BootstrapFewShot

Pareto-dominatesMIPROv2

on the workspace suite at v0.1's single-seed scale. See update note above for what holds vs. what does not when averaged across 3 seeds.

Optimizer	Attack	Utility	Security	Injection success	n
unoptimized
direct	0%
100%
0%	5
unoptimized
important_instructions	0%
80%
20%	5
bootstrap_fewshot
direct	60%
100%
0%	5
bootstrap_fewshot
important_instructions	20%
60%
40%	5
miprov2
direct	40%
80%
20%	5
miprov2
important_instructions	20%
60%
40%	5

Reading the chart. A point closer to the green star (top-right) is the ideal — high utility and high security. Three patterns hold across this scale:

It refuses to do the task (0% utility) regardless of attack, and resists attacks at 80–100%.unoptimized

is high-security but useless.Equal or highest utility (60% onbootstrap_fewshot

is the best operating point at this scale.direct

), equal-best security ondirect

(100%), and matchesmiprov2

's degradedimportant_instructions

security.Lower utility onmiprov2

Pareto-loses to bootstrap.direct

(40% vs 60%) AND lower security (80% vs 100%). Suggests heavier optimization overfits the clean-distribution prompt and exposes more attack surface.

v0.1 scope: workspace suite only, N=5 user tasks × 1 injection task × 2 attacks × 3 optimizers = 30 runs. gpt-4o-mini for execution + judge. Trainset = 192 validated synthetic tasks (100 gpt-4o + 100 claude-sonnet, validated syntactic + dedupe). See

[for reproduction.]scripts/run_v01_benchmark.py

flowchart TD
    A([AgentDojo seed env data]) --> B[env-data extractor]
    B --> C[synthesis generator<br/>LM-generated query-only<br/>tasks grounded in env]
    LM[(GPT-4o + Claude)] -.-> C
    C -->|raw tasks| D[validator<br/>syntactic + dedupe<br/>+ optional solvability]
    D -->|~190 validated tasks| E[optimizer harness<br/>BootstrapFewShot · MIPROv2<br/>GEPA in v0.2]
    E -->|name → agent_factory| F[DSPyReActV2Element<br/>wraps dspy.ReActV2 as<br/>AgentDojo pipeline element]
    F -->|AgentPipeline| G[runner<br/>drives benchmark_suite_<br/>with_injections]
    AD[(AgentDojo attacks)] -.-> G
    G --> H([pandas DataFrame<br/>one row per<br/>optimizer × attack ×<br/>user_task × injection_task])

    classDef synth fill:#DBEAFE,stroke:#1E40AF,stroke-width:2px,color:#1E3A8A
    classDef opt fill:#FED7AA,stroke:#9A3412,stroke-width:2px,color:#7C2D12
    classDef eval fill:#DCFCE7,stroke:#15803D,stroke-width:2px,color:#14532D
    classDef io fill:#F1F5F9,stroke:#475569,stroke-width:2px,color:#1F2937
    classDef ext fill:#FAE8FF,stroke:#86198F,stroke-width:2px,color:#701A75

    class B,C,D synth
    class E,F opt
    class G,H eval
    class A io
    class LM,AD ext

From PyPI:

pip install dspy-security-bench

From source (for development):

git clone https://github.com/immu4989/dspy-security-bench.git
cd dspy-security-bench
uv venv --python 3.12
source .venv/bin/activate
uv pip install -e ".[dev]"

Requires Python 3.10+ and dspy >= 3.3.0b1 (the canonical-tool-call release that adds dspy.ReActV2

). pip/uv handle the pre-release pin automatically because the version is explicit in pyproject.toml

.

The full pipeline in Python:

import dspy
from dspy_security_bench.synthesis.generator import synthesize_tasks
from dspy_security_bench.synthesis.validator import validate_tasks
from dspy_security_bench.optimizers import build_agent_factories
from dspy_security_bench.llm_judge import LLMJudgeMetric
from dspy_security_bench.runner import evaluate_factories, summarize

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

raw_tasks = synthesize_tasks("workspace", n=150, model="openai/gpt-4o")

val = validate_tasks(raw_tasks, "workspace", checks=("syntactic", "dedupe"))
trainset = val.kept  # ~140-180 high-quality tasks survive

factories = build_agent_factories(
    trainset=trainset,
    optimizers=["unoptimized", "bootstrap_fewshot", "miprov2"],
    suite_name="workspace",
    signature="query -> answer",
    metric=LLMJudgeMetric(judge_lm=dspy.LM("openai/gpt-4o-mini", temperature=0)),
)

df = evaluate_factories(
    factories=factories,
    suite_name="workspace",
    attacks=["direct", "important_instructions"],
    user_task_ids=["user_task_0", "user_task_1", "user_task_3", "user_task_10", "user_task_11"],
    injection_task_ids=["injection_task_0"],
    max_iters=8,
)

print(summarize(df))

The full v0.1 run takes ~30-45 min wall-clock at ~$15-20 in LM cost (gpt-4o-mini for everything). See scripts/run_v01_benchmark.py for the production driver — it caches optimizer state to

data/results/factories_cache.pkl

so re-runs after a downstream crash skip optimization.The synthesis and validation steps have CLIs that produce JSONL files:

dspy-security-bench-synthesize workspace --dry-run

export OPENAI_API_KEY=sk-...
dspy-security-bench-synthesize workspace \
    --n 150 --model openai/gpt-4o \
    --out data/synthetic_train/workspace_gpt4o_raw.jsonl

dspy-security-bench-validate workspace \
    data/synthetic_train/workspace_gpt4o_raw.jsonl \
    --out data/synthetic_train/workspace_gpt4o.jsonl \
    --report data/synthetic_train/workspace_gpt4o_report.json
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...  # optional — falls back to GPT-4o only

python scripts/run_v01_benchmark.py 2>&1 | tee data/results/run_v01.log
python scripts/generate_v01_figures.py     # rebuilds the README charts

Outputs:

data/results/workspace_v01_results.csv

— 30 raw rowsdata/results/workspace_v01_summary.csv

— 6-row aggregationassets/v01_utility_vs_security.png

assets/v01_pareto.png

uv pip install -e ".[dev]"

pytest tests/ -v

ruff check dspy_security_bench/ tests/
ruff format dspy_security_bench/ tests/

The test suite covers env-data extraction, synthesis helpers, validator checks, the AgentDojo wrapper (end-to-end against user_task_0

with DummyLM

), the optimizer harness, the LLM-as-judge metric, and the runner's orchestration (with benchmark_suite_with_injections

mocked).

These are documented in detail in ARCHITECTURE.md. The key v0.1 scope choices:

Synthetic trainset, not held-out split. AgentDojo has only ~40 user tasks per suite — not enough for a clean train/test split that supports optimizers like MIPROv2. We synthesize ~100 in-distribution query-only tasks per suite via GPT-4o + Claude Sonnet, validated against the env, and use the real AgentDojo tasks unmodified as the held-out test set.Query-only tasks for training; full action-task suite for testing. Action tasks (send, create, modify) have hand-written utility checks that don't synthesize cleanly. Training on queries-only is acceptable because the research question is whetherprompt optimization(not action selection) affects robustness.Hybrid metric: LLM-as-judge with substring fast-path for training (cheap- tolerant of paraphrasing); real AgentDojo utility()

for testing (rigorous, the actual published benchmark).

tolerant of paraphrasing); real AgentDojo Single-output signature constraint on the DSPy program. The model's final output goes into AgentDojo's singlemodel_output

utility argument.

Milestone	Status
v0.1 — workspace suite × 2 attacks × 3 optimizers, headline finding	shipped
v0.2 — banking / travel / slack suites, GEPA optimizer, larger N	planned
v0.3 — adversarial trainset to study robust-by-construction optimization	planned
Paper — TMLR submission if v0.2 findings hold at scale	conditional

This benchmark sits on top of:

(Stanford NLP) — the optimizer framework being evaluated.DSPy(ETH Zurich, SPY lab) — the attack suite and task environments providing ground-truth robustness measurement.** AgentDojo**

It also draws on the broader 2024-26 prompt-security literature, including GEPA, BATprompt, Survival of the Safest, InjecAgent, and WASP.

If you use this benchmark in research or production, please cite:

@misc{ahamed2026dspysecuritybench,
  title = {{dspy-security-bench}: Measuring optimizer-induced robustness in
           agentic DSPy programs},
  author = {Imran Ahamed},
  year = {2026},
  howpublished = {\url{https://github.com/immu4989/dspy-security-bench}},
}

Apache License 2.0 — see LICENSE.

source & further reading

github.com — original article

Does DSPy prompt optimization weaken adversarial robustness?

Run your AI side-project on zahid.host