This series started as a cheap-model brag and keeps getting better comments than posts. Three readers β @nazar_boyko, @txdesk, and @jugeni β independently converged on the same seam, and @jugeni put it in one line I can't improve on:
AUTO wants a corroborator the model cannot write, not a confidence it can.
Here's what that means, and why it's the sharpest critique this design has taken.
Quick recap of the earlier posts: the LLM scores four features per email β confidence
, senderTrust
, reversibility
, urgency
β and a deterministic rule maps those to a tier. The model perceives; a rule I can read decides.
But those four aren't the same kind of thing. Three of them describe the world:
senderTrust
reversibility
urgency
** confidence** is different in kind. It's the model grading its own work β "how sure am I about the other three?" There is no source outside the model's opinion of itself. And in my rule, the AUTO branch gates when
confidence >= 0.85
(alongside the others).The dangerous email isn't the one the model is unsure about β the low-confidence floor already routes that to the queue. It's the one the model is confidently wrong about. A polished impersonation that reads as a trusted sender is exactly a high-confidence, high-senderTrust
, reversible-looking email. It walks toward AUTO through the one feature the model authors about itself, and self-graded confidence is the gate that structurally can't catch a confident lie.
I want to be precise about the blast radius, because it's smaller than that paragraph sounds.
AUTO is classify-only in the current build β an AUTO classification sets a tier and triggers no action. When execution does run, AUTO only ever maps to reversible, internal actions (archive, mark-read). And the three irreversible actions β send, hard-delete, forward-external β sit behind a deterministic floor that ignores every score. So a confident impersonation that reaches AUTO gets quietly handled in a recoverable way, never anything you can't undo.
The seam is real; it just can't currently reach anything unrecoverable. But "bounded by the floor" is not the same as "designed right." The day AUTO starts taking even reversible actions on its own, leaning on a number the model wrote about itself is the wrong gate.
@jugeni's line is the spec: gate on corroboration the model can't author.
senderTrust
a manualOverrides >= N
pins it, instead of merely suggesting it to the model in the prompt.reversibility
from the confidence
as a tiebreaker, never as the thing that promotes to AUTO on its own.The pattern generalizes past email. Any time you let a model score features for a decision, sort the features by whether their source is independent of the model's self-assessment. Gate on the ones that are. The self-graded one is scenery β useful context, never authorization.
I haven't done this yet. Today confidence
still gates AUTO, and what makes that safe is the floor underneath, not the gate itself. The thing I owe is an adversarial eval: a high-confidence, polished impersonation, measured to see whether it actually reaches AUTO β turning "I think the floor saves us" into a number instead of a belief. That's next, and the eval set is in the open if you want to write the case before I do.
Three posts in, the lesson keeps being the same shape: keep the model in the perception layer, and make every decision answer to something the model can't quietly author. AGPLv3, the whole thing: ** github.com/k08200/klorn**.