{"slug": "show-hn-ai-metrics-visually", "title": "Show HN: AI Metrics, Visually", "summary": "A developer created an interactive visual guide to AI metrics used in model finetuning, explaining concepts like loss, perplexity, precision, recall, and F1 through playful analogies and hands-on demos. The tool helps beginners understand when to use each metric and why accuracy can be misleading on imbalanced data.", "body_md": "The metrics you meet when finetuning a model — grouped by *when* you use them, made playful and interactive.\n\nEvery metric here is just an anxious question someone asks about that net. Hold this picture and the rest follows.\n\n\"Everything I caught — is it actually fish, or did I also pull up boots?\"\n\nOf what I flagged positive, how much was right? Boots in the net (🥾) = false positives.\n\n\"All the fish in the lake — did I actually catch them, or did some swim through?\"\n\nOf what was really there, how much did I get? Escaped fish = false negatives.\n\nBefore judging *quality*, you watch whether the model is learning at all. During finetuning your dashboard shows **loss** dropping over time. The single most important habit for a beginner: **watch the validation curve, not the training curve.**\n\n`perplexity = e^loss`\n\n. Read it as: *\"how many equally-likely words is the model choosing between at each step?\"* Perplexity 1 = perfectly certain & correct. Perplexity 50 = as unsure as picking from 50 options. Lower is better.\n\nTrain loss always keeps dropping — the model can memorize. When **val loss turns back upward** while train loss falls, it's memorizing, not learning. That gap is your signal to **stop early**.\n\n`perplexity = e^(loss)`\n\n— loss is the raw training signal; perplexity is that same thing translated into \"number of choices\" you can picture. Like °C vs an obscure unit: same temperature, one you can feel.\nNow the same idea with real numbers. A language model's job: **given the words so far, predict the next one** — a probability for every word in its vocabulary. Perplexity asks:\n\n*\"On average, how many words is the model unsure between at each step?\"*\n\nThe true next word is `mat`\n\n. **Drag how much probability the model gave it:**\n\nThe remaining probability is spread over other words. Here's the model's guess distribution:\n\nTraining shows **cross-entropy loss** = average surprise, in abstract units (nats). `perplexity = e^loss`\n\nconverts that surprise back into a tangible *count of choices*. Loss 0 → ppl 1. Loss 2.3 → ppl ≈ 10. Loss 4.6 → ppl ≈ 100. Same info, friendlier units.\n\nIt depends on vocabulary & task, so it's only meaningful *relative* to a baseline. A strong modern LLM on general English sits around **perplexity 3–15**. Random guessing across a 50k vocab ≈ 50,000. **You use it to compare**: did finetuning lower my perplexity on my domain's text?\n\nBelow are 12 items. The **emoji is the truth**: 🐟 is actually positive, 🥾 is actually negative. **Click an item to toggle your model's prediction** (a blue ring = \"model predicts positive / caught in net\"). Watch the matrix and metrics update.\n\nBlue ring = predicted positive. Try to catch all the fish without grabbing boots.\n\nReal models output a *score* (0–1), and you pick a **threshold**: score ≥ threshold → predict positive. Below, 14 items have fixed scores. Drag the threshold and watch precision and recall pull in opposite directions.\n\nLow threshold = cast a wide net (high recall, low precision). High threshold = only the sure things (high precision, low recall).\n\n**Bold** = actually positive (🐟). Blue outline = predicted positive at this threshold.\n\n**Accuracy** = \"what fraction did I get right?\" — the most intuitive metric, and the most dangerous on **imbalanced data**. Here's a fraud detector where only a few transactions are actually fraud. Drag how rare fraud is:\n\nNow meet the **\"lazy model\"** that just predicts *\"never fraud\"* for everything — it does zero real work:\n\nThis is the whole reason precision, recall & F1 exist: they ignore the easy majority class and ask *\"did you catch the thing that matters?\"* A model can have **98% accuracy and be useless.**\n\n\"Of my catch, how much is fish?\"\n\n**Use when** false alarms are costly. *Spam filter* — blocking real email is worse than missing some spam.\n\n\"Of all the fish, how many did I catch?\"\n\n**Use when** misses are costly. *Cancer screening* — a false alarm beats missing a sick patient.\n\nHarmonic mean — dies if *either* is low.\n\n**Use when** you need balance and don't want a high score from maximizing just one side.\n\nWhen a model *generates* text (summaries, translations), the \"fish in the lake\" become the **words in the reference**. ROUGE asks: how much of the reference did the generated text catch?\n\n`dog bites man`\n\nand Generated `man bites dog`\n\n.\nROUGE-1 says perfect (same 3 words!) — but the meaning is reversed. ROUGE-2 (word pairs) and ROUGE-L (order) catch what ROUGE-1 misses.\nROUGE and BLEU both count word overlap; they just lead with different anxieties from the fishing net:\n\n\"Did I cover everything in the reference?\"\n\nStandard for **summarization** — a summary that drops key points is the failure you fear. Misses hurt.\n\n\"Is everything I generated actually correct?\"\n\nStandard for **translation** — inventing words that don't belong is the failure you fear. It adds a \"brevity penalty\" so you can't get a high score by outputting just one perfect word.\n\nROUGE and BLEU count **surface overlap**. To them, `\"the film was great\"`\n\nand `\"the movie was excellent\"`\n\nshare almost nothing — near-zero score, despite identical meaning. For finetuning a chatbot, that makes them *weak* judges.\n\nCompares **embeddings**, not exact words. \"film/movie\", \"great/excellent\" score as near-matches. The meaning-aware fix for ROUGE/BLEU's blindness.\n\nAsk a strong model (e.g. Claude) to *score* your finetuned model's answers for helpfulness, correctness, tone. The dominant method today for instruction-tuned models.\n\n\"Is answer A better than B?\" across many prompts. Pairwise preference is what RLHF and Chatbot-Arena rankings use. Humans remain the gold standard for subjective quality.", "url": "https://wpnews.pro/news/show-hn-ai-metrics-visually", "canonical_source": "https://barvhaim.github.io/visual-ai-metrics/", "published_at": "2026-06-15 14:57:16+00:00", "updated_at": "2026-06-15 15:09:04.416708+00:00", "lang": "en", "topics": ["machine-learning", "artificial-intelligence", "ai-tools", "developer-tools"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/show-hn-ai-metrics-visually", "markdown": "https://wpnews.pro/news/show-hn-ai-metrics-visually.md", "text": "https://wpnews.pro/news/show-hn-ai-metrics-visually.txt", "jsonld": "https://wpnews.pro/news/show-hn-ai-metrics-visually.jsonld"}}