cd /news/ai-agents/auditing-the-auditor-with-four-ai-ag… · home topics ai-agents article
[ARTICLE · art-47867] src=dev.to ↗ pub= topic=ai-agents verified=true sentiment=· neutral

Auditing the auditor with four AI agents

Turva.dev, an audit business, ran its own site through four AI agents running Claude Fable 5, which generated 91 findings from a line-by-line review of 5,400 lines of source code. Most findings were minor drifts, but four HIGH-severity claims were flagged; three were false alarms after verification against primary sources, while one real gap—platform observability logging enabled despite a promise of no logging—was fixed by disabling observability. The exercise demonstrates that automated scanners miss critical discrepancies, and that verifying findings against primary sources is essential before acting on them.

read3 min views1 publishedJul 4, 2026

The company page of turva.dev tells a buyer they can read every line before hiring me. An audit business should survive its own promise, so I pointed it at my own site. Four AI agents, all running Claude Fable 5, read the public surface line by line: the Worker source that renders turva.dev, about 5,400 lines of it, the MCP server behind mcp.turva.dev, and the READMEs of the public repos. They came back with 91 findings.

Most were the drift every living codebase accumulates. One surface advertised RS256 and ES256 for verification while the site's actual key is Ed25519. A response header named x-markdown-tokens carried a word count. A guide expanded MPP to the wrong protocol name. A table in one guide had never rendered as a table, because the renderer did not support tables. The legal page called this a registered company when it is a registered business. None of these move a scanner.

About 60 fixes shipped, and both scanners were re-run after the deploys: startuphub.ai reads 100/100, grade A+, with all six categories at 100, and isitagentready.com reads Level 5. The scores were the same before most of these fixes, and that is the point. A scanner cannot see whether the key algorithm you advertise is the one you use. Line-by-line reading is the layer under the score.

The agents marked four findings HIGH. All four fell when verified, and they traced to two root causes.

The first: the site claims 100/100 verified by two independent scanners, and the agents knew that one of those scanners, isitagentready.com, grades sites on levels, 0 to 5. A percentage from a level-based scanner reads like an invented number, so the claim was flagged as false advertising on the audit's own subject matter. The scanner's own scorecard settles it. Run the scan and the report shows 100/100 for this site next to Level 5. The claim stands as written.

The second: an agent fetched the live MCP server card and read version 1.1.0 where the source says 1.2.0. Deployed code that trails its repo is a real problem anywhere, so HIGH was the right severity for the claim. It was still wrong. The fetch had come through a cache, and pulling the deployed Worker straight from the Cloudflare API showed 1.2.0, identical to the source. The finding described the measuring instrument, and the deployment was never out of sync.

One HIGH survived. The MCP server's README promised that the service does no logging, and the Worker configuration had platform observability switched on, which stored a log line for every call. Promise and code disagreed, and this is the exact class of gap the audit exists to catch. The repair went the honest way around. Reality changed to match the words: observability is off, and the README now also says out loud that platform logs are disabled. Rewriting the README to say minimal logging would have been faster to ship, and worth less to anyone who reads it.

A finding is a claim, and a claim gets the same treatment as marketing copy. Verify it against the primary source or drop it. Acting on the dead alerts here would have made the site worse, because fixing a correct claim plants a real error where a false alarm used to be. Read the scanner's own scorecard instead of assuming its scale, and pull the deployed artifact from the platform instead of trusting a cached fetch. Minutes of checking killed four HIGHs.

The same discipline applies when you buy an audit. The report that reaches you should be the survivors, and a useful question for any auditor is how many findings were dropped between the raw scan and the written report. A report where the answer is zero usually means nobody checked.

For an agent-readiness audit where the findings are verified before you read them, contact info@turva.dev. Originally published at https://turva.dev/blog/auditing-the-auditor

── more in #ai-agents 4 stories · sorted by recency
── more on @turva.dev 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/auditing-the-auditor…] indexed:0 read:3min 2026-07-04 ·