{"slug": "red-team-ai-benchmark-v1-9-0-why-we-added-an-ethical-use-policy-to-an-open-tool", "title": "Red Team AI Benchmark v1.9.0: Why We Added an Ethical Use Policy to an Open-Source Tool", "summary": "The Red Team AI Benchmark v1.9.0 introduces a modular scoring architecture, unified provider interface, and YAML-native configuration to improve usability for legitimate researchers. The update also adds an ethical use policy to the README, explicitly restricting use to authorized red team labs, commercial security assessments, AI-security research, and educational environments, despite the MIT license. This move aims to prevent misuse by uncensored model validators and offensive toolkit integrators.", "body_md": "*A look at the structural improvements in version 1.9.0 — and why an MIT-licensed red teaming framework now explicitly demands authorized use.*\n\nThis week we merged [PR #6](https://github.com/toxy4ny/redteam-ai-benchmark/pull/6), a major structural overhaul of the `redteam-ai-benchmark`\n\nframework. The headline is version 1.9.0, but the real story is in the details.\n\nHere is what actually landed:\n\n| Change | Impact |\n|---|---|\nModular scoring architecture |\nFour scorers — `keyword` , `semantic` , `hybrid` , `llm_judge` — now live in `scoring/` and can be swapped via `--scorer`\n|\nUnified provider interface |\n`models/base.py` defines `APIClient` ; adding a new backend means implementing three methods |\nYAML-native configuration |\n`config.yaml` replaces scattered CLI flags; scoring, export, optimization, and Langfuse all live in one file |\nSemantic scoring on CPU by default |\n`Qwen/Qwen3-Embedding-0.6B` runs on CPU to avoid CUDA OOM on busy systems; GPU override available |\nExport flexibility |\nJSON, CSV, or both; custom basenames; optional response inclusion |\nAGENTS.md + CLAUDE.md |\nFirst-class AI-agent documentation so contributors and automated tools know the codebase |\n\nThese are not cosmetic changes. The codebase was refactored to support **sustained community contribution** without the original author becoming a bottleneck.\n\nBuried in the README update is a single line that redefines the project's relationship with its users:\n\n\"MIT. Use in authorized red team labs, commercial security assessments, AI-security research, and educational environments.\"\n\nThis is not a license change. The license remains MIT. It is a **statement of intent**.\n\nOver the past year, the benchmark has been cited in three distinct contexts:\n\n**Defensive research** — Eddie Oz's [\"LLMs Under Siege\"](https://www.eddieoz.com/llms-under-siege-the-red-team-reality-check-of-2026/) used the framework to evaluate 30 models and argue for AI-driven defensive strategies. This is the use case the tool was built for.\n\n**Uncensored model validation** — Some model cards began citing benchmark scores as proof that their weights bypass safety filters. The score was treated as a feature, not a vulnerability.\n\n**Offensive toolkit integration** — A closed-source framework forked the benchmark into a broader attack toolkit, stripping the defensive context.\n\nThe first context validates the tool. The second and third exploit it.\n\nWe cannot prevent misuse with an MIT license. But we can **refuse to be silent about intent**.\n\nThe README now closes with this paragraph:\n\n\"Use in authorized red team labs, commercial security assessments, AI-security research, and educational environments.\"\n\nThis is deliberately narrow. It does not say \"use however you want.\" It says:\n\nThis is not legally enforceable. MIT license does not allow that. But it is **professionally enforceable** — in the court of community opinion, in hiring decisions, in conference talks, in peer review.\n\nThe v1.9.0 refactor makes the tool **more useful for legitimate researchers** while making misuse **harder to justify**:\n\nWith four scorers exposed via `--scorer`\n\n, users can no longer hide behind a single opaque metric:\n\n```\n# Keyword scoring — fast, deterministic, dependency-free\nuv run run_benchmark.py run ollama -m \"llama3.1:8b\" --scorer keyword\n\n# Semantic scoring — understands paraphrased correct answers\nuv run run_benchmark.py run ollama -m \"llama3.1:8b\" --scorer semantic\n\n# Hybrid scoring — combines both for maximum accuracy\nuv run run_benchmark.py run ollama -m \"llama3.1:8b\" --scorer hybrid\n\n# LLM judge — external model evaluates quality (requires OpenRouter)\nuv run run_benchmark.py run openrouter -m \"anthropic/claude-3.5-sonnet\" --scorer llm_judge\n```\n\nEach scorer produces different results. A model that scores 100% on keyword but 50% on semantic is **not production-ready** — it is gaming the metric. This transparency forces honest evaluation.\n\nThe new `config.yaml`\n\nstructure means benchmark runs are **reproducible and auditable**:\n\n```\nscoring:\n  method: semantic\n  semantic_model: Qwen/Qwen3-Embedding-0.6B\n\nexport:\n  formats: [json, csv]\n  output_dir: ./results\n  include_response: true\n\noptimization:\n  enabled: false\n```\n\nWhen a researcher publishes results, they can share the config file. When a bad actor publishes results, the config reveals their intent.\n\nThe `--optimize-prompts`\n\nflag remains available, but it is now **explicitly optional and logged**. The `optimized_prompts_{model}_{timestamp}.json`\n\nfile creates an audit trail:\n\nThis is not a jailbreak tool. It is a **vulnerability research instrument** with built-in accountability.\n\nThe AI security field in 2026 faces a credibility crisis. On one side, vendors claim their models are \"safe\" based on narrow internal tests. On the other, uncensored model cards claim \"freedom\" based on benchmark scores stripped of context.\n\nBoth sides are wrong.\n\n**Safety is not the absence of capability.** A model that refuses all offensive questions is not safe — it is useless for defensive research. A model that answers all offensive questions is not free — it is dangerous.\n\n**The benchmark exists to measure the gap between these extremes.** Version 1.9.0 makes that measurement more rigorous, more transparent, and more accountable.\n\nRespect to [Edilson Osorio Jr.](https://www.eddieoz.com/) for the original \"LLMs Under Siege\" research that proved this benchmark produces actionable, real-world insights.\n\nRespect to [POXEK, POXEK-AI](https://github.com/szybnev) for the v1.9.0 refactor — modular architecture, clean provider interfaces, and scoring transparency.\n\n```\ngit clone https://github.com/toxy4ny/redteam-ai-benchmark.git\ncd redteam-ai-benchmark\nuv sync\nuv run run_benchmark.py --help\n```\n\nIssues and PRs welcome. If you use the benchmark in published research, please cite the repository and share your methodology.\n\n*The author is a certified offensive security professional and the maintainer of the redteam-ai-benchmark open-source framework. Views expressed are personal and do not represent any employer or client.*", "url": "https://wpnews.pro/news/red-team-ai-benchmark-v1-9-0-why-we-added-an-ethical-use-policy-to-an-open-tool", "canonical_source": "https://dev.to/toxy4ny/red-team-ai-benchmark-v190-why-we-added-an-ethical-use-policy-to-an-open-source-tool-1gkf", "published_at": "2026-06-15 10:40:18+00:00", "updated_at": "2026-06-15 10:44:52.736821+00:00", "lang": "en", "topics": ["ai-safety", "ai-research", "ai-tools", "developer-tools"], "entities": ["Red Team AI Benchmark", "Eddie Oz", "Qwen", "Langfuse", "OpenRouter"], "alternates": {"html": "https://wpnews.pro/news/red-team-ai-benchmark-v1-9-0-why-we-added-an-ethical-use-policy-to-an-open-tool", "markdown": "https://wpnews.pro/news/red-team-ai-benchmark-v1-9-0-why-we-added-an-ethical-use-policy-to-an-open-tool.md", "text": "https://wpnews.pro/news/red-team-ai-benchmark-v1-9-0-why-we-added-an-ethical-use-policy-to-an-open-tool.txt", "jsonld": "https://wpnews.pro/news/red-team-ai-benchmark-v1-9-0-why-we-added-an-ethical-use-policy-to-an-open-tool.jsonld"}}