{"slug": "benchmarking-open-source-safety-guard-models-a-comprehensive-evaluation", "title": "Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation", "summary": "A comprehensive evaluation of 14 open-source safety guard models on a benchmark of 79,331 samples found that Qwen Guard, a 4-billion-parameter model, achieved the highest recall at 83.97%, while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) missed up to 75% of unsafe content. The study, which tested models across eight NIST AI Risk Framework safety categories, determined that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.", "body_md": "arXiv:2605.28830v1 Announce Type: new\nAbstract: As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.", "url": "https://wpnews.pro/news/benchmarking-open-source-safety-guard-models-a-comprehensive-evaluation", "canonical_source": "https://arxiv.org/abs/2605.28830", "published_at": "2026-05-29 04:00:00+00:00", "updated_at": "2026-05-29 04:24:37.892429+00:00", "lang": "en", "topics": ["ai-safety", "large-language-models", "artificial-intelligence", "machine-learning", "ai-research"], "entities": ["Qwen Guard", "Llama Guard", "GPT-OSS Safeguard", "HarmBench", "StrongREJECT", "RealToxicityPrompts", "BeaverTails", "NIST"], "alternates": {"html": "https://wpnews.pro/news/benchmarking-open-source-safety-guard-models-a-comprehensive-evaluation", "markdown": "https://wpnews.pro/news/benchmarking-open-source-safety-guard-models-a-comprehensive-evaluation.md", "text": "https://wpnews.pro/news/benchmarking-open-source-safety-guard-models-a-comprehensive-evaluation.txt", "jsonld": "https://wpnews.pro/news/benchmarking-open-source-safety-guard-models-a-comprehensive-evaluation.jsonld"}}