How We Actually Measure Whether an LLM's Output Is Good - BLEU, COMET and BLEURT

Shrijith Venkatramana, building git-lrc, explains the evolution of LLM evaluation metrics from BLEU to BLEURT and COMET. BLEU, introduced in 2002, measures n-gram overlap and correlates with human judgment but fails with creative outputs. BLEURT and COMET use neural networks to assess meaning, better matching human evaluation for modern LLMs.

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product. An AI model writes a paragraph. It sounds fluent. It looks convincing. But how do you know whether it's actually good? This deceptively simple question has occupied researchers for more than two decades. Long before ChatGPT, machine translation researchers faced exactly the same problem. Human evaluation was expensive, inconsistent, and painfully slow. If every new model required thousands of humans to compare translations, research would crawl. That necessity gave rise to BLEU , one of the most influential evaluation metrics in AI history. Years later, as language models became better at paraphrasing and reasoning, BLEU started to show its age. Researchers responded with learned metrics like BLEURT and COMET , which use neural networks to judge language much more like humans do. Interestingly, this mirrors software engineering itself. We first wrote simple unit tests, then integration tests, and today we increasingly rely on sophisticated observability systems. Evaluation metrics for LLMs have undergone a similar evolution. Let's see why. Imagine you're building Google Translate in 2001. Every time your team improves the model, someone has to read thousands of translated sentences and score them. Suppose a single sentence pair takes only 20 seconds to judge. Evaluating 50,000 sentences would require nearly 280 human-hours . Now imagine dozens of experiments every week. Evaluation—not training—quickly becomes the bottleneck. Researchers at IBM, led by Kishore Papineni , introduced BLEU Bilingual Evaluation Understudy in 2002 to automate this process. Their idea was surprisingly simple: If a machine translation resembles what professional translators write, it's probably good. This became one of the most cited papers in natural language processing. BLEU compares a model's output against one or more human reference translations. Suppose the reference is: The cat is sitting on the mat. The model produces: The cat sat on the mat. Many words and short phrases overlap. Now consider: A feline rested indoors. A human recognizes this as a perfectly reasonable translation. BLEU mostly doesn't. Why? Because BLEU isn't measuring meaning. It measures shared n-grams —contiguous sequences of words. The score combines: It also penalizes outputs that are suspiciously short. High-level intuition: This simple idea turned out to correlate surprisingly well with human judgments across large datasets. When the famous Transformer paper Attention Is All You Need reported 28.4 BLEU on the WMT English-German benchmark, that represented roughly a 2 BLEU improvement over previous systems—a significant jump that helped establish the Transformer as the new state of the art. BLEU assumes that good translations look similar. Modern LLMs don't. Consider these summaries. Reference: The meeting was postponed because the client requested additional documentation. Output A: The meeting was delayed after the client asked for more documents. Output B: Client requested more paperwork, so the meeting moved. Humans would probably give both excellent scores. BLEU prefers whichever shares more exact phrases. Now imagine asking ChatGPT: Explain recursion like I'm five. There are hundreds of excellent answers. BLEU expects one. This becomes even worse for: As models became more creative, exact word overlap became a poor proxy for quality. Researchers needed evaluation metrics that understood meaning rather than wording. Google Research introduced BLEURT in 2020. Instead of counting words, BLEURT fine-tunes a pretrained Transformer to predict human evaluation scores. Think of it as hiring a reviewer instead of using a spell checker. During training, BLEURT sees: After millions of examples, it learns patterns humans value: An interesting engineering trick made BLEURT particularly effective. Human-scored datasets are relatively small. The researchers first generated large amounts of synthetically corrupted text —introducing deletions, substitutions, and paraphrases—to pretrain the evaluator before fine-tuning on expensive human judgments. This significantly reduced the amount of labeled data needed. Around the same time, researchers at Unbabel developed COMET . Like BLEURT, COMET uses a neural network. But it has access to something BLEURT often doesn't: That additional context matters. Suppose the French sentence is: Il fait froid. One candidate says: It is cold. Another says: It is freezing. Without seeing the original sentence, both might seem acceptable. With the source available, COMET can better judge whether meaning has shifted. Modern COMET models consistently show stronger correlation with professional human evaluators than BLEU across many translation benchmarks. Today, COMET is frequently reported alongside BLEU in machine translation research. Training frontier models can cost millions of dollars. Evaluation, surprisingly, can become expensive too. Imagine comparing three model versions. Each produces answers for: That's 150,000 outputs . If humans spend only 15 seconds per evaluation: 150,000 × 15 seconds ≈ 625 hours At $40/hour for expert annotators, that's $25,000 for a single evaluation round. And that's before measuring agreement between multiple reviewers. Automatic metrics dramatically reduce this cost. A common workflow today looks like: The automatic metric acts as a high-quality filter rather than a replacement for humans. No single metric captures quality perfectly. BLEU remains valuable because: BLEURT improves semantic understanding by learning from human judgments. COMET goes even further by incorporating the original source sentence and demonstrating stronger agreement with professional evaluators. For frontier LLMs, evaluation has become even broader. Researchers increasingly combine: The lesson is larger than machine translation. As AI systems become more capable, evaluating them becomes an AI problem itself. The future of language models may depend just as much on better judges as on better generators. AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production. git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free. Any feedback or contributors are welcome It's online, source-available, and ready for anyone to use. | 🇩🇰 Dansk https://github.com/HexmosTech/git-lrc/readme/README.da.md | 🇪🇸 Español https://github.com/HexmosTech/git-lrc/readme/README.es.md | 🇮🇷 Farsi https://github.com/HexmosTech/git-lrc/readme/README.fa.md | 🇫🇮 Suomi https://github.com/HexmosTech/git-lrc/readme/README.fi.md | 🇯🇵 日本語 https://github.com/HexmosTech/git-lrc/readme/README.ja.md | 🇳🇴 Norsk https://github.com/HexmosTech/git-lrc/readme/README.nn.md | 🇵🇹 Português https://github.com/HexmosTech/git-lrc/readme/README.pt.md | 🇷🇺 Русский https://github.com/HexmosTech/git-lrc/readme/README.ru.md | 🇦🇱 Shqip https://github.com/HexmosTech/git-lrc/readme/README.sq.md | 🇨🇳 中文 https://github.com/HexmosTech/git-lrc/readme/README.zh.md | 🇮🇳 हिन्दी https://github.com/HexmosTech/git-lrc/readme/README.hi.md | GenAI today is a race car without brakes . It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things : they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production. git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen At a glance: 10 risk categories https://github.com/HexmosTech/git-lrc what-git-lrc-checks-for · 100+ failure patterns tracked https://github.com/HexmosTech/git-lrc what-git-lrc-checks-for · every commit…