# Show HN: AI Metrics, Visually

> Source: <https://barvhaim.github.io/visual-ai-metrics/>
> Published: 2026-06-15 14:57:16+00:00

The metrics you meet when finetuning a model — grouped by *when* you use them, made playful and interactive.

Every metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.

"Everything I caught — is it actually fish, or did I also pull up boots?"

Of what I flagged positive, how much was right? Boots in the net (🥾) = false positives.

"All the fish in the lake — did I actually catch them, or did some swim through?"

Of what was really there, how much did I get? Escaped fish = false negatives.

Before judging *quality*, you watch whether the model is learning at all. During finetuning your dashboard shows **loss** dropping over time. The single most important habit for a beginner: **watch the validation curve, not the training curve.**

`perplexity = e^loss`

. Read it as: *"how many equally-likely words is the model choosing between at each step?"* Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.

Train loss always keeps dropping — the model can memorize. When **val loss turns back upward** while train loss falls, it's memorizing, not learning. That gap is your signal to **stop early**.

`perplexity = e^(loss)`

— loss is the raw training signal; perplexity is that same thing translated into "number of choices" you can picture. Like °C vs an obscure unit: same temperature, one you can feel.
Now the same idea with real numbers. A language model's job: **given the words so far, predict the next one** — a probability for every word in its vocabulary. Perplexity asks:

*"On average, how many words is the model unsure between at each step?"*

The true next word is `mat`

. **Drag how much probability the model gave it:**

The remaining probability is spread over other words. Here's the model's guess distribution:

Training shows **cross-entropy loss** = average surprise, in abstract units (nats). `perplexity = e^loss`

converts that surprise back into a tangible *count of choices*. Loss 0 → ppl 1. Loss 2.3 → ppl ≈ 10. Loss 4.6 → ppl ≈ 100. Same info, friendlier units.

It depends on vocabulary & task, so it's only meaningful *relative* to a baseline. A strong modern LLM on general English sits around **perplexity 3–15**. Random guessing across a 50k vocab ≈ 50,000. **You use it to compare**: did finetuning lower my perplexity on my domain's text?

Below are 12 items. The **emoji is the truth**: 🐟 is actually positive, 🥾 is actually negative. **Click an item to toggle your model's prediction** (a blue ring = "model predicts positive / caught in net"). Watch the matrix and metrics update.

Blue ring = predicted positive. Try to catch all the fish without grabbing boots.

Real models output a *score* (0–1), and you pick a **threshold**: score ≥ threshold → predict positive. Below, 14 items have fixed scores. Drag the threshold and watch precision and recall pull in opposite directions.

Low threshold = cast a wide net (high recall, low precision). High threshold = only the sure things (high precision, low recall).

**Bold** = actually positive (🐟). Blue outline = predicted positive at this threshold.

**Accuracy** = "what fraction did I get right?" — the most intuitive metric, and the most dangerous on **imbalanced data**. Here's a fraud detector where only a few transactions are actually fraud. Drag how rare fraud is:

Now meet the **"lazy model"** that just predicts *"never fraud"* for everything — it does zero real work:

This is the whole reason precision, recall & F1 exist: they ignore the easy majority class and ask *"did you catch the thing that matters?"* A model can have **98% accuracy and be useless.**

"Of my catch, how much is fish?"

**Use when** false alarms are costly. *Spam filter* — blocking real email is worse than missing some spam.

"Of all the fish, how many did I catch?"

**Use when** misses are costly. *Cancer screening* — a false alarm beats missing a sick patient.

Harmonic mean — dies if *either* is low.

**Use when** you need balance and don't want a high score from maximizing just one side.

When a model *generates* text (summaries, translations), the "fish in the lake" become the **words in the reference**. ROUGE asks: how much of the reference did the generated text catch?

`dog bites man`

and Generated `man bites dog`

.
ROUGE-1 says perfect (same 3 words!) — but the meaning is reversed. ROUGE-2 (word pairs) and ROUGE-L (order) catch what ROUGE-1 misses.
ROUGE and BLEU both count word overlap; they just lead with different anxieties from the fishing net:

"Did I cover everything in the reference?"

Standard for **summarization** — a summary that drops key points is the failure you fear. Misses hurt.

"Is everything I generated actually correct?"

Standard for **translation** — inventing words that don't belong is the failure you fear. It adds a "brevity penalty" so you can't get a high score by outputting just one perfect word.

ROUGE and BLEU count **surface overlap**. To them, `"the film was great"`

and `"the movie was excellent"`

share almost nothing — near-zero score, despite identical meaning. For finetuning a chatbot, that makes them *weak* judges.

Compares **embeddings**, not exact words. "film/movie", "great/excellent" score as near-matches. The meaning-aware fix for ROUGE/BLEU's blindness.

Ask a strong model (e.g. Claude) to *score* your finetuned model's answers for helpfulness, correctness, tone. The dominant method today for instruction-tuned models.

"Is answer A better than B?" across many prompts. Pairwise preference is what RLHF and Chatbot-Arena rankings use. Humans remain the gold standard for subjective quality.