{"slug": "how-we-actually-measure-whether-an-llm-s-output-is-good-bleu-comet-and-bleurt", "title": "How We Actually Measure Whether an LLM's Output Is Good - BLEU, COMET and BLEURT", "summary": "Shrijith Venkatramana, building git-lrc, explains the evolution of LLM evaluation metrics from BLEU to BLEURT and COMET. BLEU, introduced in 2002, measures n-gram overlap and correlates with human judgment but fails with creative outputs. BLEURT and COMET use neural networks to assess meaning, better matching human evaluation for modern LLMs.", "body_md": "*Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.*\n\n*An AI model writes a paragraph. It sounds fluent. It looks convincing. But how do you know whether it's actually good?*\n\nThis deceptively simple question has occupied researchers for more than two decades.\n\nLong before ChatGPT, machine translation researchers faced exactly the same problem. Human evaluation was expensive, inconsistent, and painfully slow. If every new model required thousands of humans to compare translations, research would crawl.\n\nThat necessity gave rise to **BLEU**, one of the most influential evaluation metrics in AI history. Years later, as language models became better at paraphrasing and reasoning, BLEU started to show its age. Researchers responded with learned metrics like **BLEURT** and **COMET**, which use neural networks to judge language much more like humans do.\n\nInterestingly, this mirrors software engineering itself. We first wrote simple unit tests, then integration tests, and today we increasingly rely on sophisticated observability systems. Evaluation metrics for LLMs have undergone a similar evolution.\n\nLet's see why.\n\nImagine you're building Google Translate in 2001.\n\nEvery time your team improves the model, someone has to read thousands of translated sentences and score them.\n\nSuppose a single sentence pair takes only 20 seconds to judge.\n\nEvaluating 50,000 sentences would require nearly **280 human-hours**.\n\nNow imagine dozens of experiments every week.\n\nEvaluation—not training—quickly becomes the bottleneck.\n\nResearchers at IBM, led by **Kishore Papineni**, introduced **BLEU (Bilingual Evaluation Understudy)** in 2002 to automate this process.\n\nTheir idea was surprisingly simple:\n\nIf a machine translation resembles what professional translators write, it's probably good.\n\nThis became one of the most cited papers in natural language processing.\n\nBLEU compares a model's output against one or more human reference translations.\n\nSuppose the reference is:\n\nThe cat is sitting on the mat.\n\nThe model produces:\n\nThe cat sat on the mat.\n\nMany words and short phrases overlap.\n\nNow consider:\n\nA feline rested indoors.\n\nA human recognizes this as a perfectly reasonable translation.\n\nBLEU mostly doesn't.\n\nWhy?\n\nBecause BLEU isn't measuring meaning.\n\nIt measures **shared n-grams**—contiguous sequences of words.\n\nThe score combines:\n\nIt also penalizes outputs that are suspiciously short.\n\nHigh-level intuition:\n\nThis simple idea turned out to correlate surprisingly well with human judgments across large datasets.\n\nWhen the famous **Transformer** paper *Attention Is All You Need* reported **28.4 BLEU** on the WMT English-German benchmark, that represented roughly a **2 BLEU improvement** over previous systems—a significant jump that helped establish the Transformer as the new state of the art.\n\nBLEU assumes that good translations look similar.\n\nModern LLMs don't.\n\nConsider these summaries.\n\nReference:\n\nThe meeting was postponed because the client requested additional documentation.\n\nOutput A:\n\nThe meeting was delayed after the client asked for more documents.\n\nOutput B:\n\nClient requested more paperwork, so the meeting moved.\n\nHumans would probably give both excellent scores.\n\nBLEU prefers whichever shares more exact phrases.\n\nNow imagine asking ChatGPT:\n\nExplain recursion like I'm five.\n\nThere are hundreds of excellent answers.\n\nBLEU expects one.\n\nThis becomes even worse for:\n\nAs models became more creative, exact word overlap became a poor proxy for quality.\n\nResearchers needed evaluation metrics that understood meaning rather than wording.\n\nGoogle Research introduced **BLEURT** in 2020.\n\nInstead of counting words, BLEURT fine-tunes a pretrained Transformer to predict human evaluation scores.\n\nThink of it as hiring a reviewer instead of using a spell checker.\n\nDuring training, BLEURT sees:\n\nAfter millions of examples, it learns patterns humans value:\n\nAn interesting engineering trick made BLEURT particularly effective.\n\nHuman-scored datasets are relatively small.\n\nThe researchers first generated large amounts of **synthetically corrupted text**—introducing deletions, substitutions, and paraphrases—to pretrain the evaluator before fine-tuning on expensive human judgments.\n\nThis significantly reduced the amount of labeled data needed.\n\nAround the same time, researchers at **Unbabel** developed **COMET**.\n\nLike BLEURT, COMET uses a neural network.\n\nBut it has access to something BLEURT often doesn't:\n\nThat additional context matters.\n\nSuppose the French sentence is:\n\nIl fait froid.\n\nOne candidate says:\n\nIt is cold.\n\nAnother says:\n\nIt is freezing.\n\nWithout seeing the original sentence, both might seem acceptable.\n\nWith the source available, COMET can better judge whether meaning has shifted.\n\nModern COMET models consistently show stronger correlation with professional human evaluators than BLEU across many translation benchmarks.\n\nToday, COMET is frequently reported alongside BLEU in machine translation research.\n\nTraining frontier models can cost millions of dollars.\n\nEvaluation, surprisingly, can become expensive too.\n\nImagine comparing three model versions.\n\nEach produces answers for:\n\nThat's **150,000 outputs**.\n\nIf humans spend only 15 seconds per evaluation:\n\n150,000 × 15 seconds ≈ **625 hours**\n\nAt $40/hour for expert annotators, that's **$25,000** for a single evaluation round.\n\nAnd that's before measuring agreement between multiple reviewers.\n\nAutomatic metrics dramatically reduce this cost.\n\nA common workflow today looks like:\n\nThe automatic metric acts as a high-quality filter rather than a replacement for humans.\n\nNo single metric captures quality perfectly.\n\nBLEU remains valuable because:\n\nBLEURT improves semantic understanding by learning from human judgments.\n\nCOMET goes even further by incorporating the original source sentence and demonstrating stronger agreement with professional evaluators.\n\nFor frontier LLMs, evaluation has become even broader.\n\nResearchers increasingly combine:\n\nThe lesson is larger than machine translation.\n\nAs AI systems become more capable, **evaluating them becomes an AI problem itself.**\n\nThe future of language models may depend just as much on better judges as on better generators.\n\n*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.\n\ngit-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*\n\nAny feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.\n\n| [🇩🇰 Dansk](https://github.com/HexmosTech/git-lrc/readme/README.da.md) | [🇪🇸 Español](https://github.com/HexmosTech/git-lrc/readme/README.es.md) | [🇮🇷 Farsi](https://github.com/HexmosTech/git-lrc/readme/README.fa.md) | [🇫🇮 Suomi](https://github.com/HexmosTech/git-lrc/readme/README.fi.md) | [🇯🇵 日本語](https://github.com/HexmosTech/git-lrc/readme/README.ja.md) | [🇳🇴 Norsk](https://github.com/HexmosTech/git-lrc/readme/README.nn.md) | [🇵🇹 Português](https://github.com/HexmosTech/git-lrc/readme/README.pt.md) | [🇷🇺 Русский](https://github.com/HexmosTech/git-lrc/readme/README.ru.md) | [🇦🇱 Shqip](https://github.com/HexmosTech/git-lrc/readme/README.sq.md) | [🇨🇳 中文](https://github.com/HexmosTech/git-lrc/readme/README.zh.md) | [🇮🇳 हिन्दी](https://github.com/HexmosTech/git-lrc/readme/README.hi.md) |\n\nGenAI today is a **race car without brakes**. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents *silently break things*: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.\n\n** git-lrc is your braking system.** It hooks into\n\n`git commit`\n\nand runs an AI review on every diff In short, git-lrc helps **Prevent Outages, Breaches, and Technical Debt Before They Happen**\n\n**At a glance:** [10 risk categories](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · [100+ failure patterns tracked](https://github.com/HexmosTech/git-lrc#what-git-lrc-checks-for) · every commit…", "url": "https://wpnews.pro/news/how-we-actually-measure-whether-an-llm-s-output-is-good-bleu-comet-and-bleurt", "canonical_source": "https://dev.to/shrsv/how-we-actually-measure-whether-an-llms-output-is-good-bleu-comet-and-bleurt-3c0f", "published_at": "2026-06-26 18:36:27+00:00", "updated_at": "2026-06-26 19:04:09.930879+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-research", "machine-learning"], "entities": ["Shrijith Venkatramana", "git-lrc", "IBM", "Kishore Papineni", "Google Research", "BLEU", "BLEURT", "COMET"], "alternates": {"html": "https://wpnews.pro/news/how-we-actually-measure-whether-an-llm-s-output-is-good-bleu-comet-and-bleurt", "markdown": "https://wpnews.pro/news/how-we-actually-measure-whether-an-llm-s-output-is-good-bleu-comet-and-bleurt.md", "text": "https://wpnews.pro/news/how-we-actually-measure-whether-an-llm-s-output-is-good-bleu-comet-and-bleurt.txt", "jsonld": "https://wpnews.pro/news/how-we-actually-measure-whether-an-llm-s-output-is-good-bleu-comet-and-bleurt.jsonld"}}