Show HN: AI Metrics, Visually

wpnews.pro

The metrics you meet when finetuning a model — grouped by when you use them, made playful and interactive.

Every metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.

"Everything I caught — is it actually fish, or did I also pull up boots?"

Of what I flagged positive, how much was right? Boots in the net (🥾) = false positives.

"All the fish in the lake — did I actually catch them, or did some swim through?"

Of what was really there, how much did I get? Escaped fish = false negatives.

Before judging quality, you watch whether the model is learning at all. During finetuning your dashboard shows loss dropping over time. The single most important habit for a beginner: watch the validation curve, not the training curve.

perplexity = e^loss

. Read it as: "how many equally-likely words is the model choosing between at each step?" Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.

Train loss always keeps dropping — the model can memorize. When val loss turns back upward while train loss falls, it's memorizing, not learning. That gap is your signal to stop early.

perplexity = e^(loss) — loss is the raw training signal; perplexity is that same thing translated into "number of choices" you can picture. Like °C vs an obscure unit: same temperature, one you can feel. Now the same idea with real numbers. A language model's job: given the words so far, predict the next one — a probability for every word in its vocabulary. Perplexity asks:

"On average, how many words is the model unsure between at each step?"

The true next word is mat

. Drag how much probability the model gave it:

The remaining probability is spread over other words. Here's the model's guess distribution:

Training shows cross-entropy loss = average surprise, in abstract units (nats). perplexity = e^loss

converts that surprise back into a tangible count of choices. Loss 0 → ppl 1. Loss 2.3 → ppl ≈ 10. Loss 4.6 → ppl ≈ 100. Same info, friendlier units.

It depends on vocabulary & task, so it's only meaningful relative to a baseline. A strong modern LLM on general English sits around perplexity 3–15. Random guessing across a 50k vocab ≈ 50,000. You use it to compare: did finetuning lower my perplexity on my domain's text?

Below are 12 items. The emoji is the truth: 🐟 is actually positive, 🥾 is actually negative. Click an item to toggle your model's prediction (a blue ring = "model predicts positive / caught in net"). Watch the matrix and metrics update.

Blue ring = predicted positive. Try to catch all the fish without grabbing boots.

Real models output a score (0–1), and you pick a threshold: score ≥ threshold → predict positive. Below, 14 items have fixed scores. Drag the threshold and watch precision and recall pull in opposite directions.

Low threshold = cast a wide net (high recall, low precision). High threshold = only the sure things (high precision, low recall).

Bold = actually positive (🐟). Blue outline = predicted positive at this threshold.

Accuracy = "what fraction did I get right?" — the most intuitive metric, and the most dangerous on imbalanced data. Here's a fraud detector where only a few transactions are actually fraud. Drag how rare fraud is:

Now meet the "lazy model" that just predicts "never fraud" for everything — it does zero real work:

This is the whole reason precision, recall & F1 exist: they ignore the easy majority class and ask "did you catch the thing that matters?" A model can have 98% accuracy and be useless.

"Of my catch, how much is fish?"

Use when false alarms are costly. Spam filter — blocking real email is worse than missing some spam.

"Of all the fish, how many did I catch?"

Use when misses are costly. Cancer screening — a false alarm beats missing a sick patient.

Harmonic mean — dies if either is low.

Use when you need balance and don't want a high score from maximizing just one side.

When a model generates text (summaries, translations), the "fish in the lake" become the words in the reference. ROUGE asks: how much of the reference did the generated text catch?

dog bites man

and Generated man bites dog

. ROUGE-1 says perfect (same 3 words!) — but the meaning is reversed. ROUGE-2 (word pairs) and ROUGE-L (order) catch what ROUGE-1 misses. ROUGE and BLEU both count word overlap; they just lead with different anxieties from the fishing net:

"Did I cover everything in the reference?"

Standard for summarization — a summary that drops key points is the failure you fear. Misses hurt.

"Is everything I generated actually correct?"

Standard for translation — inventing words that don't belong is the failure you fear. It adds a "brevity penalty" so you can't get a high score by outputting just one perfect word.

ROUGE and BLEU count surface overlap. To them, "the film was great"

and "the movie was excellent"

share almost nothing — near-zero score, despite identical meaning. For finetuning a chatbot, that makes them weak judges.

Compares embeddings, not exact words. "film/movie", "great/excellent" score as near-matches. The meaning-aware fix for ROUGE/BLEU's blindness.

Ask a strong model (e.g. Claude) to score your finetuned model's answers for helpfulness, correctness, tone. The dominant method today for instruction-tuned models.

"Is answer A better than B?" across many prompts. Pairwise preference is what RLHF and Chatbot-Arena rankings use. Humans remain the gold standard for subjective quality.

source & further reading

barvhaim.github.io — original article

Show HN: AI Metrics, Visually

Run your AI side-project on zahid.host