cd /news/large-language-models/confidence-is-the-one-signal-your-mo… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-43400] src=dev.to β†— pub= topic=large-language-models verified=true sentiment=Β· neutral

Confidence is the one signal your model can't corroborate

A developer building an LLM-based email classifier identified a critical design flaw: the model's self-reported confidence score is used as a gate for autonomous actions, but a confident impersonation could bypass it. The developer argues that decisions should rely on corroborating signals the model cannot author, not on its own self-assessment. The current implementation limits autonomous actions to reversible ones, but the flaw remains a design concern.

read3 min views1 publishedJun 29, 2026

This series started as a cheap-model brag and keeps getting better comments than posts. Three readers β€” @nazar_boyko, @txdesk, and @jugeni β€” independently converged on the same seam, and @jugeni put it in one line I can't improve on:

AUTO wants a corroborator the model cannot write, not a confidence it can.

Here's what that means, and why it's the sharpest critique this design has taken.

Quick recap of the earlier posts: the LLM scores four features per email β€” confidence , senderTrust

, reversibility

, urgency

β€” and a deterministic rule maps those to a tier. The model perceives; a rule I can read decides.

But those four aren't the same kind of thing. Three of them describe the world:

senderTrust

reversibility

urgency

** confidence** is different in kind. It's the model grading its own work β€” "how sure am I about the other three?" There is no source outside the model's opinion of itself. And in my rule, the AUTO branch gates when

confidence >= 0.85 (alongside the others).The dangerous email isn't the one the model is unsure about β€” the low-confidence floor already routes that to the queue. It's the one the model is confidently wrong about. A polished impersonation that reads as a trusted sender is exactly a high-confidence, high-senderTrust

, reversible-looking email. It walks toward AUTO through the one feature the model authors about itself, and self-graded confidence is the gate that structurally can't catch a confident lie.

I want to be precise about the blast radius, because it's smaller than that paragraph sounds.

AUTO is classify-only in the current build β€” an AUTO classification sets a tier and triggers no action. When execution does run, AUTO only ever maps to reversible, internal actions (archive, mark-read). And the three irreversible actions β€” send, hard-delete, forward-external β€” sit behind a deterministic floor that ignores every score. So a confident impersonation that reaches AUTO gets quietly handled in a recoverable way, never anything you can't undo.

The seam is real; it just can't currently reach anything unrecoverable. But "bounded by the floor" is not the same as "designed right." The day AUTO starts taking even reversible actions on its own, leaning on a number the model wrote about itself is the wrong gate.

@jugeni's line is the spec: gate on corroboration the model can't author. senderTrust

a manualOverrides >= N pins it, instead of merely suggesting it to the model in the prompt.reversibility

from the confidence as a tiebreaker, never as the thing that promotes to AUTO on its own.The pattern generalizes past email. Any time you let a model score features for a decision, sort the features by whether their source is independent of the model's self-assessment. Gate on the ones that are. The self-graded one is scenery β€” useful context, never authorization.

I haven't done this yet. Today confidence

still gates AUTO, and what makes that safe is the floor underneath, not the gate itself. The thing I owe is an adversarial eval: a high-confidence, polished impersonation, measured to see whether it actually reaches AUTO β€” turning "I think the floor saves us" into a number instead of a belief. That's next, and the eval set is in the open if you want to write the case before I do.

Three posts in, the lesson keeps being the same shape: keep the model in the perception layer, and make every decision answer to something the model can't quietly author. AGPLv3, the whole thing: ** github.com/k08200/klorn**.

── more in #large-language-models 4 stories Β· sorted by recency
── more on @klorn 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/confidence-is-the-on…] indexed:0 read:3min 2026-06-29 Β· β€”