Email Triage Taxonomies for LLM Classification

wpnews.pro

The most important design decision in an email classifier isn't the model — it's the label set, and here's the one I keep coming back to:

You triage email into one of four categories:

URGENT  — production incidents, executive requests; reply within 1 hour
ACTION  — code reviews, meeting follow-ups; reply same day
FYI     — informational, no response needed
NOISE   — newsletters, marketing, automated notifications

From:    {sender}
Subject: {subject}
Snippet: {snippet}

Return ONLY the category name. Nothing else.

That's the working prompt from the Nylas email triage recipe, and almost every line encodes a taxonomy-design lesson worth unpacking. Most people building email agents obsess over model choice and prompt phrasing. The recipe's quiet thesis is that the label set itself does the heavy lifting — get the taxonomy right and a cheap model classifies well; get it wrong and no model saves you.

The recipe states it flatly: four is the right number. Three loses fidelity — everything important collapses into one overloaded bucket and you've built a binary classifier with extra steps. Five and the model starts confusing categories, because the boundaries between adjacent labels get too thin to express in a definition.

Notice what makes these four work. They aren't topics — they're response obligations. URGENT means "reply within the hour," ACTION means "reply today," FYI means "no response needed," NOISE means "archive." Each label maps to exactly one behavior. That's the test I'd apply to any email taxonomy: if two labels lead to the same action, merge them; if one label leads to two different actions depending on content, split it.

The same principle shows up in the sales context. The Agent Accounts overview describes an outreach agent classifying replies as interested / not now / unsubscribe — three labels, because the workflow has exactly three branches: book the meeting, schedule a follow-up, stop emailing. The taxonomy is the decision tree, flattened.

The recipe makes this literal. Its entire action loop is a for

over unread messages with one branch per label:

for msg in fetch_unread():
    cat = classify(msg)
    if cat in ("URGENT", "ACTION"):
        draft_reply(msg)
    elif cat == "NOISE":
        archive(msg)

Two details here do more work than they appear to. First, the agent never sends — URGENT and ACTION produce drafts a human reviews, because the cost of a wrong send (wrong person, wrong tone) is far higher than the friction of one extra click. Second, the loop is idempotent by construction: it only pulls --unread

messages, so anything already triaged falls out of the next run without a dedup table. The taxonomy didn't just classify the mail; it shaped the control flow into something you can run from cron unattended.

Drafting also runs at different settings than classification — temperature=0.7

with a "three sentences max" instruction, versus the classifier's temperature=0

. Deterministic decisions, natural prose. Same pipeline, two different jobs, and the recipe is blunt that the sentence cap is load-bearing: without it you get drafts that read like a politely overcompensating intern.

Look again at the prompt. URGENT isn't defined as "very important and time-sensitive" — it's "production incidents, executive requests." Concrete instances, not abstract qualities. LLMs pattern-match far better against examples than against adjectives, and ambiguous adjectives are where classifiers drift: one model's "important" is another's "routine."

The deadline annotations ("reply within 1 hour," "reply same day") double as tie-breakers. When a message sits between two buckets, the model can ask the implicit question — does this need an answer in an hour or a day? — which is a much sharper discriminator than topical similarity.

Taxonomy design extends to the response format. The recipe runs classification at temperature=0

with max_tokens=10

: deterministic output, one category name, no room for an explanatory paragraph. And it still validates — the code checks the response against the four valid strings and falls back to FYI

on anything unrecognized, because LLMs occasionally invent a category. An unrecognized label defaulting to "leave it alone" is the safe failure; defaulting to NOISE would silently archive real mail.

Input is constrained just as aggressively: sender, subject, and a 200-character snippet — never the full body. That's enough for over 90% accuracy on this task, and it keeps costs almost ignorable. The recipe's math: GPT-4o-mini runs about $0.15 per million input tokens, a snippet plus prompt is roughly 150 tokens, so 100 emails cost around $0.002. Drafting uses the pricier GPT-4o, but only on the URGENT and ACTION subset — typically under 20% of the inbox — so a heavy 200-message day still costs about a nickel. Cheap classification is what makes the whole pattern viable as a cron job running every fifteen minutes rather than a precious resource you ration. And for mail that can't leave your infrastructure, the recipe's privacy mode swaps in a local Ollama endpoint: Llama 3.1 classifies nearly as well as GPT-4o-mini on this task, though drafting quality drops unless you're running a 70B+ parameter model.

The pushback I hear: a fixed taxonomy throws away nuance — why not let the model return free-form tags, or scores along multiple axes? Honestly, sometimes that's right. If you're building analytics over a support inbox, richer structure (category plus urgency plus confidence, as the multi-day support patterns do) earns its complexity, since downstream consumers can aggregate it.

But for an agent that has to act, free-form output is a liability. Every distinct output the model can produce is a code path you have to handle, and "handle" means test. Four labels means four branches you can reason about, load-test, and audit. Forty emergent tags means a routing layer that's effectively another model call. The recipe's discipline — closed vocabulary, validated output, one action per label — is what makes the agent's behavior predictable enough to run unattended against a real mailbox. (Agent Accounts, where you'd give that agent its own inbox to triage, are in beta — taxonomy patterns apply to any mailbox.)

One refinement worth planning for: taxonomies are per-inbox, not universal. The recipe notes that engineering inboxes hit URGENT differently than sales inboxes, so the category definitions — not the category count — are where you customize.

Here's an exercise that takes twenty minutes: pull the last 50 messages from whatever inbox your agent will manage and hand-label them with the four buckets above. Wherever you hesitate, write down why — those hesitations are your category boundaries telling you they need sharper example lists. What labels did your inbox force you to add, and what did they map to in terms of action?

source & further reading

dev.to — original article I stopped reviewing my own code. Here's what had to be true first. qm multiplayer AI agent tutorial: Cut Latency 20% with Node.js Will AI replace software?

Email Triage Taxonomies for LLM Classification

Run your AI side-project on zahid.host