Confidence is enough to decide. It's not enough to do.

wpnews.pro

A classifier confidence of 0.99 is enough to decide a tier. It is not enough to send an email you can't unsend.

Those are two different bars, and most "autonomous" systems use the first one to clear the second. That's the bug.

This is the third post in a series that started as a cheap-model brag and turned into an architecture argument. Post one: a cheap model beat GPT-4o on email triage. Post two: the model only scores four features, and a deterministic rule picks the tier. A commenter, @hannune, pointed at one of those four features:

Your reversibility signal is something I have not seen named explicitly before but it is exactly the right axis for anything that touches irreversible state.

He's right, and it's the cleanest way into the last piece of the design. So: what reversibility

actually routes.

Most of what a mail agent does is reversible. Archive, un-archive. Trash, restore. Apply a label, remove it. Mark read, mark unread. Re-tier. Snooze. Every one of those is a single click away from undone, so every one of those rides on exactly what post two described — classifier confidence plus a hash of the input bytes that drove the decision. If the model's confident and the inputs are pinned, ship it.

Three actions are not like the others:

export const FLOOR_ACTIONS = ["send_email", "delete_permanent", "forward_external"] as const;

Send (Gmail's undo-send window is 30 seconds, then it's gone). Permanent delete (skips Trash, no recovery path). Forward to an external party (same network effect as send — it's out). For these, reversibility

scores near zero, and near-zero reversibility is the signal that says: confidence is necessary but no longer sufficient. You need something the probabilistic layer can't give you.

The failure mode here has a name I borrowed from people doing this in crypto: agent-vs-ABI mismatch. The agent narrates a high-level intent — "I sent a polite follow-up to Alice" — and the thing that actually executed did something the narration glossed over: wrong recipient, an edited body, a different attachment. The agent isn't lying. Natural language is lossy by definition; the description and the bytes are allowed to drift.

The cure isn't to verify the narration harder. It's to stop signing on the narration and sign on the deterministic artifact — the actual bytes that will travel to Gmail.

export function sendEmailPayloadHash(input: { to: string; subject: string; body: string }): string {
  const canonical = {
    v: RECEIPT_SCHEMA_VERSION,
    action: "send_email" as const,
    to: input.to.normalize("NFC").trim().toLowerCase(),
    subject: input.subject.normalize("NFC"),
    body: input.body.normalize("NFC"),
  };
  return crypto.createHash("sha256").update(JSON.stringify(canonical)).digest("hex");
}

When you approve a send, the system mints an ActionReceipt

that pins this hash — the bytes you actually approved, normalized so a cosmetic edit (Alice@Example.com

vs alice@example.com

) doesn't false-alarm, and NFC-normalized so composed/decomposed Unicode hashes identically (this matters the moment a body has Korean in it). At execute time it recomputes the hash from the about-to-send bytes and checks:

export function verifyReceipt(receipt: ActionReceipt, expected: { action: FloorAction; currentPayloadHash: string }): void {
  if (receipt.v !== RECEIPT_SCHEMA_VERSION) throw new ActionReceiptSchemaError(receipt);
  if (receipt.action !== expected.action) throw new ActionReceiptMismatchError(receipt, expected.currentPayloadHash);
  if (receipt.payloadHash !== expected.currentPayloadHash) throw new ActionReceiptMismatchError(receipt, expected.currentPayloadHash);
}

Any drift between approve and execute throws and the action is refused. Reusing a send_email

receipt to authorize a delete_permanent

throws on the action check. Bumping the schema version deliberately invalidates every pending receipt and forces a re-approve under the new shape. The autonomous path fails closed: no valid receipt, no irreversible action.

That's the whole point of naming reversibility

as a first-class feature instead of folding it into "risk." It's not decoration on the tier decision — it's the axis that decides which trust model an action even gets:

Two layers, and the feature score picks which one applies. The probabilistic layer is allowed to stay probabilistic precisely because the floor catches the cases where probability isn't enough.

Fair question: is this actually wired, or just a module with a TODO? It's wired. The receipt is minted at /approve

from the exact bytes you clicked on, executeToolCall

refuses any floor action that arrives without a verified receipt (FloorReceiptRequiredError

), and send_email

re-checks the payload hash in its own path before anything leaves.

The honest edges, because there always are some: of the three floor actions, only send_email

is a callable tool today — delete_permanent

and forward_external

aren't wired as tool cases yet, but the central guard already fails them closed, so a future case physically can't ship a receipt-less side effect. And the autonomous agent runs in SUGGEST mode by default — read-only tools plus propose-only, no mutating power until you opt into AUTO, and even then the floor stands in front of the irreversible three. The brake went in before the autonomous engine gets switched on, which is the only order that isn't reckless. I'd rather show you the guard and its TODOs than claim more than the code does.

Separate "confident enough to decide" from "verified enough to do." For anything your system can't undo with one user click, don't trust the model's description of what it's about to do — hash the deterministic artifact at approval, verify it at execution, and fail closed on any drift. Confidence is a fine reason to decide. It is never, by itself, a reason to do something you can't take back.

The whole floor is ~210 readable lines in the open, AGPLv3: ** github.com/k08200/klorn** —

packages/api/src/attention-floor.ts

. Three posts, one idea: keep the model in the perception layer, and put everything you actually stand behind in code you can read.

source & further reading

dev.to — original article MLOps for LLM: A Case Study on Dresscode How I built ZeroAudit — AI-powered SOC 2 compliance automation with AWS DynamoDB and Vercel Evaluating a C# LLM Eventparser with Promptfoo

Confidence is enough to decide. It's not enough to do.

Run your AI side-project on zahid.host