AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

wpnews.pro

*Part 2 of a series on building production AI on .NET. Part 1 covered what evals are and the Analyze → Measure → Improve lifecycle. This post is about the step everyone wants to skip: *Analyze

When a team decides to "take evals seriously," the first thing they usually do is wrong. They open a dashboard tool, wire up a generic "correctness" score, and watch a number. It feels productive. It produces a chart. And it tells them almost nothing, because they skipped the step that decides what the chart should even measure.

That step is error analysis: reading your AI's actual outputs and naming, precisely, the ways they go wrong. It's unglamorous — no library, no dashboard, just you and a few dozen real examples. It is also, by a wide margin, the highest-leverage thing you will do in evals: error analysis is where the signal comes from. Everything downstream is just operationalising what you find here.

There's a gap between you and your running system that's easy to underestimate. Thousands of inputs flow through your AI feature daily, in shapes you never anticipated, and you have no realistic way to see them at scale. Call it the comprehension gap — the distance between the developer and a true understanding of what the data and the model are actually doing.

Metrics don't bridge that gulf; they presuppose it's already bridged. To measure "conciseness" you must first have noticed that verbosity is a failure mode worth caring about. If you pick your metrics before you've read your data, you're measuring your assumptions, not your product. The classic result: a dashboard glowing green while users quietly churn over a problem your metrics were never designed to catch.

Error analysis is how you cross the gulf. You trade scale for truth — you can't read everything, so you read a sample, carefully.

It's a three-move loop, and the moves are deliberately low-tech.

1. Get a starting dataset and read it. Pull a sample of real (or realistic) outputs — 50 to 100 is plenty to start. Not the happy-path demo cases; the real distribution, including the weird inputs. Then actually read them. Slowly.

2. Open-code the failures. For each output that's wrong, write a short, free-text note describing what specifically is wrong — in your own words, no fixed categories yet. "Explained the word using a dictionary definition instead of the meaning it has in this sentence." "Translation is correct but the tone is far too formal for a casual chat." "The quiz distractor is so obviously wrong it gives the answer away." This is open coding: you're labelling reality, not forcing it into boxes.

3. Cluster the notes into a taxonomy. Once you have 40–50 notes, patterns emerge. Group them. Those groups are your failure taxonomy — a ranked list of how your feature fails, with rough frequencies. Now you know what to fix first (the common, severe modes) and, crucially, what your metrics should measure.

That's the whole secret. The taxonomy is the output, and it's worth more than any single score, because every later step — the rubric, the golden set, the judge — is downstream of it.

The hard part of error analysis isn't mechanical, it's psychological. You will be tempted to immediately assign a 1–5 score, or to jump to "the fix is to add a line to the prompt." Resist both. Scoring too early collapses rich information ("it's a 2") into a number that hides why. Fixing too early means you patch the first failure you see instead of the most common one.

Stay descriptive for as long as you can. Your only job in this phase is to understand and categorise. Judgement and repair come later.

A second trap is doing it alone. When two people label the same outputs, they disagree — and the disagreements are gold, because they reveal that "good" isn't actually defined yet. A short alignment session to resolve them sharpens your definition of quality before you bake it into a rubric. (Solo founders can approximate this by labelling, sleeping on it, and re-labelling cold.)

This isn't abstract for us. TextStack has seven AI surfaces, and every rubric we score against came directly out of reading failures, not out of a generic template.

Take Explain (tap a word, get a short in-context explanation). Reading real outputs surfaced a recurring failure: the model would produce a competent dictionary definition while ignoring the sentence the reader was actually looking at — useless for someone trying to understand this passage. That single observation is why the Explain rubric scores accuracy in context and usefulness to a learner as distinct axes, and explicitly penalises dictionary boilerplate under conciseness. The rubric is a direct transcription of the taxonomy.

Other surfaces produced different taxonomies, and therefore different axes:

We didn't invent those dimensions in a meeting. We read outputs until the dimensions were obvious. And because every AI call is traced and viewable on an internal /ai-quality

page, error analysis isn't a one-time exercise — new production failures keep feeding new categories back into the taxonomy.

Error analysis is the part of evals with no tooling, no dashboard, and the highest payoff — and that's exactly why it gets skipped. Read your failures, name them in plain language, and cluster them into a taxonomy. That taxonomy tells you what to fix and what to measure. Skip it and you'll build a beautiful measurement system pointed at the wrong target.

Next in the series: golden datasets that don't lie — turning your taxonomy into a curated set of cases you can score against, without quietly fooling yourself.

TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.

source & further reading

dev.to — original article MCP Servers Are Bringing Live SEO Data to AI Keyword Research Workflows The Most Enduring Skills of a Software Engineer Scoring Documents Against a Content Model Without an LLM

AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

Run your AI side-project on zahid.host