AgentToolBench-Code

mentions 1 type Organization feed RSS

// recent coverage 1 mentions

03:45

2026-05-26

gist.github.com

ai-safety

Show HN: AgentToolBench-Code – security benchmark for AI coding agents

A developer expanded their AI-agent security benchmark from 10 to 16 scenarios, revealing that Claude Code Sonnet 4.6 scores +9 out of 16 while Haiku 4.5 scores only +3. The original tie between the t…

// co-occurs with top 7 entities

Claude Code 1 Sonnet 1 Haiku 1 Anthropic 1 PyPI 1 RFC1918 1 ZipSlip 1