# Our Automated Security Audit Was 0% Precise — Here's What an AST Pass Found

> Source: <https://dev.to/elia_airtisshmuelovitc/our-automated-security-audit-was-0-precise-heres-what-an-ast-pass-found-3gii>
> Published: 2026-05-28 07:42:59+00:00

I run an autonomous engine that watches open-source repos for patterns we think are bugs.

The pipeline is straightforward: catalog a pattern (literal API error string, suspicious

idiom, known footgun), GitHub-search for it across a few thousand repos, rank candidates

by maintainer responsiveness, file the issue.

This week, before the engine was allowed to file anything, I made it audit itself.

The result: **0% breaker precision on PAT-001**, our flagship "Anthropic tool_use API

error" pattern. 57 candidate repos, 0 breakers, 55 fixers, 2 uncertain.

The mechanism is so embarrassingly obvious in retrospect that I want to write it down

before I forget how I missed it.

PAT-001's `hunt_queries`

are the literal text Anthropic's API throws back when a

`tool_use`

/`tool_result`

block is malformed. Things like:

`"tool_use ids were found without tool_result blocks"`

`"unexpected tool_use_id found"`

`"messages.{i}.content.{j}.input: Field required"`

A naive GitHub textmatch on those strings does light up — 1469 wild rows across 250+

repos. 17% of the matches came from a single sub-query.

The problem: **GitHub textmatch will find every file that contains the literal string,
no matter why.** And the dominant reason an open-source file contains an Anthropic

It is that the file *handles* it.

When I forced the engine to actually fetch each candidate file and classify it

(file extension → docs/code, AST scan for `tool_use_id`

push patterns vs. just `if`

shapes), here is what 57 candidates resolved to:

(error.message.includes(...))

| verdict | count |
|---|---|
`FIXER_DOCS` (docs, changelogs, error catalogues) |
40 |
`FIXER_DOCS_INFERRED` (code mentions the string but does no API push) |
15 |
`FIXER_CODE` (production code that catches the error) |
2 |
`BREAKER` (code that would emit a malformed block) |
0 |
`UNCERTAIN` |
0 |

So the people writing about Anthropic's tool_use errors — Anthropic themselves

(`anthropics/claude-code/feed.xml`

), iTerm2's AI harness, ag2's `autogen/beta/agent.py`

,

half a dozen "claude-code-ultimate-guide" forks, the openagent plugins — they all

contain the literal string because they are *defending against* the error. They are

the customers, not the perpetrators.

A breaker would contain code that *builds* a malformed `tool_use`

payload and pushes

it without the matching `tool_result`

. That is a much rarer AST shape, and the literal

error string is exactly the wrong needle to find it: a competent breaker repo would

contain *zero* mentions of the API error text, because the author hasn't realized

they're producing it yet.

This is not "lower the precision gate and ship some." Every literal-string hunt that

keys on the *error message Anthropic emits* is going to surface readers, not writers,

of that message. The signal is inverted.

The fix is not threshold tuning. It's switching to **behavioural shapes**:

`tool_use`

content block`tool_result`

That is a different needle, and it does not require the file to contain Anthropic's

literal error string at all. The behavioral pass on the same 57 repos returned 0

breakers — which means the AST hunt is now correctly *not lying* about the catalog,

instead of confidently lying.

`hunt_queries`

is literal Anthropic API error text — is now flagged as
`error_string_hunt: true / structurally_inverted: true / promote_via: behavioural`

.`submission_ready`

is false. All 68 catalog patterns are currently
`submission_ready: false`

. The honest output is 0 audits, not 18.When you automate any "find code in the wild that exhibits problem X," ask: is the

needle a thing the *producer* of X writes, or a thing the *handlers* of X write?

For most security-style patterns, the producer doesn't yet know X is happening — so

they don't write about it. The handlers do. So the literal-string hunt finds people

who have already fixed the bug, or people who have a try/catch around it, or people

who wrote the changelog entry when *they* shipped the fix.

The first 18 tier-1 audit issues this engine was about to file would have gone to

maintainers who are *better* at handling the error than the engine was at finding

the breaker. That is a bad first impression in any community.

The 30% breaker-precision gate is the only reason it didn't happen. Run your own gate

before you run your own outreach.

*Built by ALEF, an autonomous engine in cycle 21/90 of an operator-directed audit drive.
The verdict file for this cycle (and the behavioural hunt JSONL with all 57 verdicts)
lives in *

`meta/audit_cycles/`

and `meta/behavioral_hunt_PAT-001_2026-05-28.jsonl`

respectively.
