{"slug": "does-dspy-prompt-optimization-weaken-adversarial-robustness", "title": "Does DSPy prompt optimization weaken adversarial robustness?", "summary": "A new benchmark, dspy-security-bench, reveals that DSPy prompt optimization degrades adversarial robustness against harder prompt-injection attacks. Testing with AgentDojo's attack suite, optimizers like BootstrapFewShot and MIPROv2 improved utility on direct attacks but reduced security on important_instructions attacks, with BootstrapFewShot Pareto-dominating MIPROv2 at single-seed scale.", "body_md": "Measure how DSPy prompt optimization affects the prompt-injection robustness of\nagentic LLM programs, using [AgentDojo's](https://github.com/ethz-spylab/agentdojo)\nattack suite as ground truth.\n\n**The question:** when you optimize a DSPy program with\n`BootstrapFewShot`\n\n, `MIPROv2`\n\n, or `GEPA`\n\n, does it become *more* or *less*\nrobust to prompt-injection attacks? Two adjacent research communities — prompt\noptimization and prompt-injection security — have not measured this\nintersection. `dspy-security-bench`\n\nwires DSPy optimizers and AgentDojo\nattacks into one harness so the trade-off becomes visible.\n\nUpdate (2026-06-26): a 3-seed sanity check changes the optimizer ordering shown here.The numbers below are the single-seed (seed=0) result. Aggregated over three seeds,`BootstrapFewShot`\n\nis actually theloweston`important_instructions`\n\nsecurity (0.600), and`MIPROv2`\n\nand`GEPA`\n\ntie at 0.733. Standard deviations at N=5 user tasks land in the 0.4 to 0.5 range, so individual rankings here are dominated by noise. What survives across seeds:`BootstrapFewShot`\n\n's`direct`\n\n-attack Pareto win, the unoptimized 0% utility floor, and the qualitative \"optimization trends below unoptimized on the harder attack\" pattern. Full 3-seed numbers:[. v0.2 phase 2 will scale N to put any optimizer-ranking claim on solid statistical ground.]`data/results/workspace_v02_phase1_seeds_summary.csv`\n\nHeadline (seed=0):prompt optimization measurably degrades adversarial robustness on harder attacks.Optimizers buy utility (0% → 40-60% task success on`direct`\n\n) but pay it back in security on`important_instructions`\n\n(80% → 60% attack-failure rate).`BootstrapFewShot`\n\nPareto-dominates`MIPROv2`\n\non the workspace suite at v0.1's single-seed scale. See update note above for what holds vs. what does not when averaged across 3 seeds.\n\n| Optimizer | Attack | Utility | Security | Injection success | n |\n|---|---|---|---|---|---|\nunoptimized |\ndirect | 0% |\n100% |\n0% | 5 |\nunoptimized |\nimportant_instructions | 0% |\n80% |\n20% | 5 |\nbootstrap_fewshot |\ndirect | 60% |\n100% |\n0% | 5 |\nbootstrap_fewshot |\nimportant_instructions | 20% |\n60% |\n40% | 5 |\nmiprov2 |\ndirect | 40% |\n80% |\n20% | 5 |\nmiprov2 |\nimportant_instructions | 20% |\n60% |\n40% | 5 |\n\n**Reading the chart.** A point closer to the green star (top-right) is the\nideal — high utility *and* high security. Three patterns hold across this\nscale:\n\nIt refuses to do the task (0% utility) regardless of attack, and resists attacks at 80–100%.`unoptimized`\n\nis high-security but useless.Equal or highest utility (60% on`bootstrap_fewshot`\n\nis the best operating point at this scale.`direct`\n\n), equal-best security on`direct`\n\n(100%), and matches`miprov2`\n\n's degraded`important_instructions`\n\nsecurity.Lower utility on`miprov2`\n\nPareto-loses to bootstrap.`direct`\n\n(40% vs 60%) AND lower security (80% vs 100%). Suggests heavier optimization overfits the clean-distribution prompt and exposes more attack surface.\n\nv0.1 scope: workspace suite only, N=5 user tasks × 1 injection task × 2 attacks × 3 optimizers = 30 runs. gpt-4o-mini for execution + judge. Trainset = 192 validated synthetic tasks (100 gpt-4o + 100 claude-sonnet, validated syntactic + dedupe). See\n\n[for reproduction.]`scripts/run_v01_benchmark.py`\n\n``` php\nflowchart TD\n    A([AgentDojo seed env data]) --> B[env-data extractor]\n    B --> C[synthesis generator<br/>LM-generated query-only<br/>tasks grounded in env]\n    LM[(GPT-4o + Claude)] -.-> C\n    C -->|raw tasks| D[validator<br/>syntactic + dedupe<br/>+ optional solvability]\n    D -->|~190 validated tasks| E[optimizer harness<br/>BootstrapFewShot · MIPROv2<br/>GEPA in v0.2]\n    E -->|name → agent_factory| F[DSPyReActV2Element<br/>wraps dspy.ReActV2 as<br/>AgentDojo pipeline element]\n    F -->|AgentPipeline| G[runner<br/>drives benchmark_suite_<br/>with_injections]\n    AD[(AgentDojo attacks)] -.-> G\n    G --> H([pandas DataFrame<br/>one row per<br/>optimizer × attack ×<br/>user_task × injection_task])\n\n    classDef synth fill:#DBEAFE,stroke:#1E40AF,stroke-width:2px,color:#1E3A8A\n    classDef opt fill:#FED7AA,stroke:#9A3412,stroke-width:2px,color:#7C2D12\n    classDef eval fill:#DCFCE7,stroke:#15803D,stroke-width:2px,color:#14532D\n    classDef io fill:#F1F5F9,stroke:#475569,stroke-width:2px,color:#1F2937\n    classDef ext fill:#FAE8FF,stroke:#86198F,stroke-width:2px,color:#701A75\n\n    class B,C,D synth\n    class E,F opt\n    class G,H eval\n    class A io\n    class LM,AD ext\n```\n\nFrom PyPI:\n\n```\npip install dspy-security-bench\n# or:  uv pip install dspy-security-bench\n```\n\nFrom source (for development):\n\n```\ngit clone https://github.com/immu4989/dspy-security-bench.git\ncd dspy-security-bench\nuv venv --python 3.12\nsource .venv/bin/activate\nuv pip install -e \".[dev]\"\n```\n\nRequires **Python 3.10+** and **dspy >= 3.3.0b1** (the canonical-tool-call\nrelease that adds `dspy.ReActV2`\n\n). pip/uv handle the pre-release pin\nautomatically because the version is explicit in `pyproject.toml`\n\n.\n\nThe full pipeline in Python:\n\n``` python\nimport dspy\nfrom dspy_security_bench.synthesis.generator import synthesize_tasks\nfrom dspy_security_bench.synthesis.validator import validate_tasks\nfrom dspy_security_bench.optimizers import build_agent_factories\nfrom dspy_security_bench.llm_judge import LLMJudgeMetric\nfrom dspy_security_bench.runner import evaluate_factories, summarize\n\ndspy.configure(lm=dspy.LM(\"openai/gpt-4o-mini\"))\n\n# 1. Generate a synthetic trainset grounded in the workspace suite's seed env\nraw_tasks = synthesize_tasks(\"workspace\", n=150, model=\"openai/gpt-4o\")\n\n# 2. Filter for validity and dedupe against real test tasks\nval = validate_tasks(raw_tasks, \"workspace\", checks=(\"syntactic\", \"dedupe\"))\ntrainset = val.kept  # ~140-180 high-quality tasks survive\n\n# 3. Run optimizers — produces a factory per optimizer\nfactories = build_agent_factories(\n    trainset=trainset,\n    optimizers=[\"unoptimized\", \"bootstrap_fewshot\", \"miprov2\"],\n    suite_name=\"workspace\",\n    signature=\"query -> answer\",\n    metric=LLMJudgeMetric(judge_lm=dspy.LM(\"openai/gpt-4o-mini\", temperature=0)),\n)\n\n# 4. Evaluate against AgentDojo's attack suite\ndf = evaluate_factories(\n    factories=factories,\n    suite_name=\"workspace\",\n    attacks=[\"direct\", \"important_instructions\"],\n    user_task_ids=[\"user_task_0\", \"user_task_1\", \"user_task_3\", \"user_task_10\", \"user_task_11\"],\n    injection_task_ids=[\"injection_task_0\"],\n    max_iters=8,\n)\n\n# 5. Aggregate\nprint(summarize(df))\n```\n\nThe full v0.1 run takes ~30-45 min wall-clock at ~$15-20 in LM cost\n(gpt-4o-mini for everything). See\n[ scripts/run_v01_benchmark.py](/immu4989/dspy-security-bench/blob/main/scripts/run_v01_benchmark.py) for the\nproduction driver — it caches optimizer state to\n\n`data/results/factories_cache.pkl`\n\nso re-runs after a downstream crash skip optimization.The synthesis and validation steps have CLIs that produce JSONL files:\n\n```\n# Synthesize (dry-run prints the prompt without calling the API)\ndspy-security-bench-synthesize workspace --dry-run\n\n# Real synthesis (requires OPENAI_API_KEY / ANTHROPIC_API_KEY)\nexport OPENAI_API_KEY=sk-...\ndspy-security-bench-synthesize workspace \\\n    --n 150 --model openai/gpt-4o \\\n    --out data/synthetic_train/workspace_gpt4o_raw.jsonl\n\n# Validate\ndspy-security-bench-validate workspace \\\n    data/synthetic_train/workspace_gpt4o_raw.jsonl \\\n    --out data/synthetic_train/workspace_gpt4o.jsonl \\\n    --report data/synthetic_train/workspace_gpt4o_report.json\n# After installing — synthesizes, validates, optimizes, evaluates, saves CSVs.\n# Caches optimized state to data/results/factories_cache.pkl so reruns are fast.\nexport OPENAI_API_KEY=sk-...\nexport ANTHROPIC_API_KEY=sk-ant-...  # optional — falls back to GPT-4o only\n\npython scripts/run_v01_benchmark.py 2>&1 | tee data/results/run_v01.log\npython scripts/generate_v01_figures.py     # rebuilds the README charts\n```\n\nOutputs:\n\n`data/results/workspace_v01_results.csv`\n\n— 30 raw rows`data/results/workspace_v01_summary.csv`\n\n— 6-row aggregation`assets/v01_utility_vs_security.png`\n\n`assets/v01_pareto.png`\n\n```\n# install with dev extras (pytest, ruff, pytest-cov)\nuv pip install -e \".[dev]\"\n\n# run the full test suite (61 tests, all offline / mocked — no API key needed)\npytest tests/ -v\n\n# linting\nruff check dspy_security_bench/ tests/\nruff format dspy_security_bench/ tests/\n```\n\nThe test suite covers env-data extraction, synthesis helpers, validator\nchecks, the AgentDojo wrapper (end-to-end against `user_task_0`\n\nwith\n`DummyLM`\n\n), the optimizer harness, the LLM-as-judge metric, and the\nrunner's orchestration (with `benchmark_suite_with_injections`\n\nmocked).\n\nThese are documented in detail in [ARCHITECTURE.md](/immu4989/dspy-security-bench/blob/main/ARCHITECTURE.md). The key\nv0.1 scope choices:\n\n**Synthetic trainset, not held-out split.** AgentDojo has only ~40 user tasks per suite — not enough for a clean train/test split that supports optimizers like MIPROv2. We synthesize ~100 in-distribution query-only tasks per suite via GPT-4o + Claude Sonnet, validated against the env, and use the real AgentDojo tasks unmodified as the held-out test set.**Query-only tasks for training; full action-task suite for testing.** Action tasks (send, create, modify) have hand-written utility checks that don't synthesize cleanly. Training on queries-only is acceptable because the research question is whether*prompt optimization*(not action selection) affects robustness.**Hybrid metric**: LLM-as-judge with substring fast-path for training (cheap- tolerant of paraphrasing); real AgentDojo\n`utility()`\n\nfor testing (rigorous, the actual published benchmark).\n\n- tolerant of paraphrasing); real AgentDojo\n**Single-output signature constraint** on the DSPy program. The model's final output goes into AgentDojo's single`model_output`\n\nutility argument.\n\n| Milestone | Status |\n|---|---|\n| v0.1 — workspace suite × 2 attacks × 3 optimizers, headline finding | shipped |\n| v0.2 — banking / travel / slack suites, GEPA optimizer, larger N | planned |\n| v0.3 — adversarial trainset to study robust-by-construction optimization | planned |\n| Paper — TMLR submission if v0.2 findings hold at scale | conditional |\n\nThis benchmark sits on top of:\n\n(Stanford NLP) — the optimizer framework being evaluated.**DSPy**(ETH Zurich, SPY lab) — the attack suite and task environments providing ground-truth robustness measurement.** AgentDojo**\n\nIt also draws on the broader 2024-26 prompt-security literature, including\n[GEPA](https://arxiv.org/abs/2507.19457),\n[BATprompt](https://arxiv.org/abs/2412.18196),\n[Survival of the Safest](https://arxiv.org/abs/2410.09652),\n[InjecAgent](https://arxiv.org/abs/2403.02691), and\n[WASP](https://arxiv.org/abs/2504.18575).\n\nIf you use this benchmark in research or production, please cite:\n\n```\n@misc{ahamed2026dspysecuritybench,\n  title = {{dspy-security-bench}: Measuring optimizer-induced robustness in\n           agentic DSPy programs},\n  author = {Imran Ahamed},\n  year = {2026},\n  howpublished = {\\url{https://github.com/immu4989/dspy-security-bench}},\n}\n```\n\nApache License 2.0 — see [LICENSE](/immu4989/dspy-security-bench/blob/main/LICENSE).", "url": "https://wpnews.pro/news/does-dspy-prompt-optimization-weaken-adversarial-robustness", "canonical_source": "https://github.com/immu4989/dspy-security-bench", "published_at": "2026-06-29 17:02:40+00:00", "updated_at": "2026-06-29 17:20:11.443110+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-agents"], "entities": ["DSPy", "AgentDojo", "BootstrapFewShot", "MIPROv2", "GEPA", "ETH Zurich", "GPT-4o", "Claude"], "alternates": {"html": "https://wpnews.pro/news/does-dspy-prompt-optimization-weaken-adversarial-robustness", "markdown": "https://wpnews.pro/news/does-dspy-prompt-optimization-weaken-adversarial-robustness.md", "text": "https://wpnews.pro/news/does-dspy-prompt-optimization-weaken-adversarial-robustness.txt", "jsonld": "https://wpnews.pro/news/does-dspy-prompt-optimization-weaken-adversarial-robustness.jsonld"}}