I don't trust the LLM to classify my email. So I don't let it.

wpnews.pro

cd /news/large-language-models/i-don-t-trust-the-llm-to-classify-my… · home › topics › large-language-models › article

[ARTICLE · art-38895] src=dev.to ↗ pub=2026-06-25T05:57Z topic=large-language-models verified=true sentiment=↑ positive

I don't trust the LLM to classify my email. So I don't let it.

A developer built an email triage system that uses an LLM only to score four features (confidence, sender trust, reversibility, urgency) rather than to classify emails directly. The actual tier decision (PUSH, QUEUE, SILENT, AUTO) is made by a deterministic rule-based function in a separate file, enabling auditability, unit testing, and easy policy changes. This architecture allows a cheap, fast model to outperform a frontier model like GPT-4o because consistency matters more than reasoning for the scoring task.

read5 min views1 publishedJun 25, 2026

My classifier calls an LLM on every single email. The LLM is not allowed to classify the email.

That sounds like a contradiction. It's the most important design decision in the thing.

A reader named @nazar_boyko left a comment on my last post — the one where a cheap model beat GPT-4o on email triage — and put it better than I did:

Once the LLM is a feature scorer and not the decider, "consistency over genius" falls right out of it, and a cheap fast model is exactly what you want for reading the same four signals the same way every time.

The price upset was the fun headline. This is the actual thesis. So here it is on its own.

Every inbound email goes to the LLM with one job: read the message and return four scores between 0 and 1.

The response schema is literally:

{"confidence":0.0,"senderTrust":0.0,"reversibility":0.0,"urgency":0.0,"reason":"short phrase"}

No tier. The model never sees the words PUSH, QUEUE, SILENT, AUTO in its output contract. It reads an email and describes it along four axes. It does not get a vote on what happens next.

What happens next lives in one file, tier-policy.ts

, in a function with no model in it:

// 1. Very low confidence → QUEUE. Hiding uncertain mail is the worst failure.
if (f.confidence < 0.5) return "QUEUE";

// 2. Urgent AND sure → wake the user.
if (f.urgency >= 0.7 && f.confidence >= 0.7) return "PUSH";

// 3. Anonymous, no clock, trivially reversible → SILENT (narrow: marketing only).
if (f.senderTrust < 0.2 && f.urgency < 0.2 && f.reversibility > 0.9) return "SILENT";

// 4. Reversible, very sure, not urgent, trusted → AUTO.
if (f.reversibility >= 0.85 && f.confidence >= 0.85 && f.urgency < 0.5 && f.senderTrust >= 0.5) return "AUTO";

// 5. Default → QUEUE. "I'll look at it on my own schedule" is the dominant bucket.
return "QUEUE";

That's the whole decider. Every threshold is a named constant in one object above it, not a magic number sprinkled through a prompt. Order matters — earlier branches win. I can read this in thirty seconds, write a unit test for each branch, and change the policy without touching the model or re-running an eval.

Try doing any of that to "I asked GPT-4o to pick a tier and it picked QUEUE." You can't test it. You can't diff it. You can't explain to yourself why message #4,012 got hidden. The decision isn't anywhere — it's smeared across a weight matrix and a paragraph of prompt.

Once the model's only job is to score four signals, the question stops being "which model reasons best about email policy?" and becomes "which model reads the same four signals the same way every time?"

Those are different questions with different answers. The first one points you at the biggest, most expensive model. The second one points you at a cheap, fast, low-variance one — because a frontier model's extra reasoning, applied to a 30-word email, mostly buys you more ways to have an opinion, which is variance, which is the enemy when you've already moved the judgment into a rule. That's why the cheap model won the last post. It wasn't a cost compromise. Splitting scorer from decider is what made the cheap model the correct choice, not just the affordable one.

And because the contract is "four features → tier" and nothing else, the model isn't load-bearing for correctness — it's load-bearing for perception. Proof: when the LLM is down or rate-limited, a keyword fallback produces the same four features with zero model calls, and the exact same rule runs on top. The plumbing doesn't change. The only thing a better model buys you is sharper feature scores on the genuinely ambiguous mail — which is why the one place I'll spend a frontier model is a dial that escalates only the low-confidence tail, and nowhere else.

Auditability. The policy is a file. Code review covers it. A regression test pins every branch.

Stable learning. When I correct a misclassification, the correction doesn't go fight the model for control of the answer. It becomes an example that nudges the feature scores toward the right values, and the rule — the spine — stays fixed. The thing that learns and the thing that decides are separated on purpose.

A blast radius you chose. AUTO's thresholds sit deliberately high (reversibility ≥ 0.85, confidence ≥ 0.85, trusted sender) so the system structurally cannot auto-handle a destructive or low-trust action. That floor is a number I can point at, not a behavior I'm hoping the model keeps exhibiting.

This doesn't make the model's judgment good — it makes the decision layer honest. Garbage feature scores still produce garbage tiers; the rule only guarantees that identical scores always map to the identical tier, and that I can see why. The thresholds were hand-tuned against 50 emails, and calibrating them from accumulated real corrections is still ahead of me, not behind. The keyword fallback, by design, can't emit PUSH — so a total LLM outage degrades urgent mail to "visible in the queue," never "silently hidden," but it does degrade. I'd rather write that down than pretend the split is free.

This isn't really about email. Any time you're handing an LLM a decision with consequences, you can ask the same question: does the model need to decide, or does it need to read? Separate "what the model perceives" from "what the system does about it." Put the second half in code you can read, test, and stand behind. You get auditability, you get to use a cheaper model, and you stop being surprised by your own product.

The judge, the rule, and the thresholds are all in the open — AGPLv3: ** github.com/k08200/klorn**. The decider is

packages/api/src/tier-policy.ts

, about sixty readable lines. Go see how few of them there are.

source & further reading

dev.to — original article I Took the Udacity AWS Machine Learning Engineer Nanodegree. Here's What It Actually Teaches (2026) I Was About to Cancel Claude. Now Gemini Is Rate-Limiting Me Out of My Own Plan. 🤖 The Agentic Loop 🔄 Loop Engineering : A Practical Field Guide 📘

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-don-t-trust-the-llm-to…

Read original on dev.to → dev.to/k08200/i-dont-trust-the-llm-to-classify-m…

mentioned entities

GPT-4o

OpenAI

metadata

slugi-don-t-trust-the-llm-to-classify-my-email-so-i-don-t-let-it

topic#large-language-models

secondary2 topics

sentimentpositive

canonicaldev.to

navigation

← prevEurope Bets on BRIOCHE to Showca…

next →I Was About to Cancel Claude. No…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 25 Jun · #large-language-models

How to Transcribe Meetings Locally in 2026 (Whisper, On-Device)

dev.to · 25 Jun · #large-language-models

OpenClaw and Hermes agree on what an agent is. They disagree on what controls it.

dev.to · 25 Jun · #large-language-models

How to Run Voice-to-Text Locally on Your Desktop (Whisper, Offline Dictation)

dev.to · 25 Jun · #large-language-models

How to Point Your IDE and Apps at a Local AI Model (Private, On-Device)

── more on @gpt-4o 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required