My classifier calls an LLM on every single email. The LLM is not allowed to classify the email.
That sounds like a contradiction. It's the most important design decision in the thing.
A reader named @nazar_boyko left a comment on my last post β the one where a cheap model beat GPT-4o on email triage β and put it better than I did:
Once the LLM is a feature scorer and not the decider, "consistency over genius" falls right out of it, and a cheap fast model is exactly what you want for reading the same four signals the same way every time.
The price upset was the fun headline. This is the actual thesis. So here it is on its own.
Every inbound email goes to the LLM with one job: read the message and return four scores between 0 and 1.
The response schema is literally:
{"confidence":0.0,"senderTrust":0.0,"reversibility":0.0,"urgency":0.0,"reason":"short phrase"}
No tier. The model never sees the words PUSH, QUEUE, SILENT, AUTO in its output contract. It reads an email and describes it along four axes. It does not get a vote on what happens next.
What happens next lives in one file, tier-policy.ts
, in a function with no model in it:
// 1. Very low confidence β QUEUE. Hiding uncertain mail is the worst failure.
if (f.confidence < 0.5) return "QUEUE";
// 2. Urgent AND sure β wake the user.
if (f.urgency >= 0.7 && f.confidence >= 0.7) return "PUSH";
// 3. Anonymous, no clock, trivially reversible β SILENT (narrow: marketing only).
if (f.senderTrust < 0.2 && f.urgency < 0.2 && f.reversibility > 0.9) return "SILENT";
// 4. Reversible, very sure, not urgent, trusted β AUTO.
if (f.reversibility >= 0.85 && f.confidence >= 0.85 && f.urgency < 0.5 && f.senderTrust >= 0.5) return "AUTO";
// 5. Default β QUEUE. "I'll look at it on my own schedule" is the dominant bucket.
return "QUEUE";
That's the whole decider. Every threshold is a named constant in one object above it, not a magic number sprinkled through a prompt. Order matters β earlier branches win. I can read this in thirty seconds, write a unit test for each branch, and change the policy without touching the model or re-running an eval.
Try doing any of that to "I asked GPT-4o to pick a tier and it picked QUEUE." You can't test it. You can't diff it. You can't explain to yourself why message #4,012 got hidden. The decision isn't anywhere β it's smeared across a weight matrix and a paragraph of prompt.
Once the model's only job is to score four signals, the question stops being "which model reasons best about email policy?" and becomes "which model reads the same four signals the same way every time?"
Those are different questions with different answers. The first one points you at the biggest, most expensive model. The second one points you at a cheap, fast, low-variance one β because a frontier model's extra reasoning, applied to a 30-word email, mostly buys you more ways to have an opinion, which is variance, which is the enemy when you've already moved the judgment into a rule. That's why the cheap model won the last post. It wasn't a cost compromise. Splitting scorer from decider is what made the cheap model the correct choice, not just the affordable one.
And because the contract is "four features β tier" and nothing else, the model isn't load-bearing for correctness β it's load-bearing for perception. Proof: when the LLM is down or rate-limited, a keyword fallback produces the same four features with zero model calls, and the exact same rule runs on top. The plumbing doesn't change. The only thing a better model buys you is sharper feature scores on the genuinely ambiguous mail β which is why the one place I'll spend a frontier model is a dial that escalates only the low-confidence tail, and nowhere else.
Auditability. The policy is a file. Code review covers it. A regression test pins every branch.
Stable learning. When I correct a misclassification, the correction doesn't go fight the model for control of the answer. It becomes an example that nudges the feature scores toward the right values, and the rule β the spine β stays fixed. The thing that learns and the thing that decides are separated on purpose.
A blast radius you chose. AUTO's thresholds sit deliberately high (reversibility β₯ 0.85, confidence β₯ 0.85, trusted sender) so the system structurally cannot auto-handle a destructive or low-trust action. That floor is a number I can point at, not a behavior I'm hoping the model keeps exhibiting.
This doesn't make the model's judgment good β it makes the decision layer honest. Garbage feature scores still produce garbage tiers; the rule only guarantees that identical scores always map to the identical tier, and that I can see why. The thresholds were hand-tuned against 50 emails, and calibrating them from accumulated real corrections is still ahead of me, not behind. The keyword fallback, by design, can't emit PUSH β so a total LLM outage degrades urgent mail to "visible in the queue," never "silently hidden," but it does degrade. I'd rather write that down than pretend the split is free.
This isn't really about email. Any time you're handing an LLM a decision with consequences, you can ask the same question: does the model need to decide, or does it need to read? Separate "what the model perceives" from "what the system does about it." Put the second half in code you can read, test, and stand behind. You get auditability, you get to use a cheaper model, and you stop being surprised by your own product.
The judge, the rule, and the thresholds are all in the open β AGPLv3: ** github.com/k08200/klorn**. The decider is
packages/api/src/tier-policy.ts
, about sixty readable lines. Go see how few of them there are.