I tested 4 AI agent-governance tools against an open spec - here's the matrix Here is a 2-3 sentence factual summary of the article: The article evaluates four AI agent-governance tools against an open specification called AgentBoundary, finding that all tools fail to produce portable, verifiable audit receipts that non-engineers can use to reconstruct agent actions. The author tested each vendor's audit artifacts against 40 conformance scenarios, revealing that most tools record decisions (allow/deny) but omit critical details like policy versions, argument hashes, and execution outcomes needed for independent verification. The AgentBoundary spec aims to solve this by defining a deterministic JSON receipt format that allows third parties to verify agent actions without access to internal databases or engineering teams. The scenario Your AI agent just deleted a customer record. Three months later, an auditor asks you to prove: - What tool actually ran not "the agent made a deletion call" — the precise tool, version, and capability - With what arguments the exact customer ID, scoped fields, options — byte-for-byte - Who approved it which human, or which automated policy rule - Against which version of which policy the literal policy bundle the runtime evaluated, not "the policy at the time, probably" - Whether it actually succeeded not "we said allow", but "the downstream system confirmed the row is gone" You open your audit log. It says: delete customer approved, run id=xyz, decision=allow . The arguments are in a different table. The policy version isn't recorded anywhere — you'd have to git log your settings file. The execution outcome lives in your application logs, which roll over after 30 days. And the auditor has no way to verify any of this without an engineer walking them through every join. This gap shows up the moment an agent does something consequential and a non-engineer needs to understand what happened. It's the same gap regardless of which framework you used. Approval is not proof. What's actually missing The pattern across every agent-governance tool I looked at is the same: they're built around the decision allow / deny / require-approval and treat the action itself as an implementation detail. So the audit log records "the policy fired" but not a single record carrying everything a third party needs to reconstruct what actually happened. A useful audit artifact has to survive the following: - It can be verified without trusting the runtime that produced it. If your auditor has to call your engineers to interpret the log, the log is testimony, not evidence. - The arguments and the decision are cryptographically bound. If args mutate between approval and execution, the audit must show it. - The policy version is in the record. Not "the policy at the time" — the literal bundle identifier. - The execution outcome is in the record. Approval ≠ execution. Both belong in the same artifact. - The chain of receipts is tamper-evident. Deleting a row from history must break something a verifier can detect. A receipt that does all five becomes a single evidence record you can hand to an auditor, regulator, insurer, or a compliance team six months later — without them needing access to your database, your cloud creds, or your engineering team. What I built AgentBoundary is an open spec for that kind of receipt. v0.1 is stable; v0.2-alpha draft adds the optional provenance block and singly-linked chain shown in the example below. Same JSON document, deterministic schema, hash-bound to its arguments. Here's one a Discord agent I run in production emitted on 2026-05-21 — it files GitHub issues on behalf of users: { "version": "agentboundary/v0.2-alpha", "receipt id": "f04df972-f9fc-4624-83cb-0ed3682297cf", "issued at": "2026-05-21T06:54:39.251Z", "actor": { "type": "agent", "id": "agent:jambot:discord:user:aa74fa40751b528f" }, "tool": { "name": "github-rest", "version": "2022-11-28", "capability": "github.issues.create" }, "target": { "system": "github.com/jamjet-labs/jamjet-discord-bot", "environment": "prod" }, "arguments hash": "2d257d4e72f62afa112766154b9b5ac0dd98ae79ee7c2758563a4363a0fb4bdf", "policy": { "name": "jambot.file-issue.v1", "version": "1", "decision": "allow" }, "execution": { "status": "success", "completed at": "2026-05-21T06:54:40.103Z", "result ref": "github://issues/1" }, "prior receipt": { "receipt id": "cab5eff7-…", "receipt hash": "3e7f5a93…" }, "completeness score": 0.913, "receipt hash": "..." } A verifier with only this JSON — no database, no Fly.io credentials, no GitHub token, no Discord session — can run six independent checks: - Tamper-evidence. Re-canonicalise the body without receipt hash , take SHA-256, confirm it matches the stored hash. - Argument binding. Re-canonicalise the arguments separately, take SHA-256, confirm it matches arguments hash . If anything mutated between approval and execution, this fails. - Spec compliance. Fetch the public JSON Schema, validate the receipt structurally. - Chain integrity. Fetch the receipt at prior receipt.receipt id and confirm its hash matches the link. - Emitter honesty. Recompute completeness score from the provenance block using the deterministic formula in the spec. Catches an emitter that lies about how confident it was in each field. - Execution proof. Follow execution.result ref to a real downstream artifact in this case, a public GitHub issue and read it. How existing tools do against the bar I built one adapter per vendor — translating their normative artifact or, where they don't have one, the developer-recommended capture shape into an AgentBoundary v0.2-alpha receipt. Then I ran all 40 conformance scenarios against the adapter-produced receipts. | Vendor | PASS | PARTIAL | DOCS-ONLY | NOT COVERED | N/A | |---|---|---|---|---|---| JamJet reference | 40 | 0 | 0 | 0 | 0 | | Anthropic permission policy | 12 | 9 | 3 | 14 | 2 | | Cloudflare HITL Agents | 5 | 7 | 1 | 25 | 2 | | LangSmith Gateway | 15 | 14 | 1 | 8 | 2 | | Microsoft AGT | 17 | 5 | 1 | 15 | 2 | Reference implementation first; vendors alphabetical. Not ranked. The PASS counts collapse meaningful categorical differences. Each vendor is solving for a different layer of the stack: - Anthropic's is the richest runtime evaluation pipeline of the four — layered hooks, scoped tool patterns, permission modes, the permission policy canUseTool callback. But the audit log from Anthropic's Managed Agents Console isn't a published schema, so there's no portable artifact a third party can verify. That's why 3 DOCS-ONLY highest of any vendor and 14 NOT COVERED. - Cloudflare HITL is a workflow primitive — durable approval gates with multi-day windows and external notifications. It's deliberately not an emitted-artifact format. The 25 NOT COVERED reflects that their recommended audit table is 6 columns and doesn't model the things conformance is asking about. - LangSmith is an observability platform. The Run object captures the data, but where in the Run varies by team convention — one team puts the decision in tags , another in feedback stats . A cross-team auditor can't reliably extract it. That's why 14 PARTIAL. - Microsoft AGT is the closest peer — also an artifact format, also designed for verifiable evidence, with a Merkle-chained audit log that's structurally stronger than AgentBoundary's current singly-linked design. The 15 NOT COVERED rows are deliberate scoping decisions, not bugs. Per-vendor breakdowns with structural reasoning live in adapters/