How I use an LLM as a translation judge

The article explains how the author uses GEMBA-MQM v2, an LLM-based system, to automate translation quality evaluation by classifying errors by type and severity according to the MQM industry standard. The author notes that while this method ranked first at WMT24 for correlating with human judgments, individual LLM evaluations are noisy, with scores varying widely across multiple passes on the same translation. To address this, the author recommends running 10 passes per segment, removing outliers beyond two standard deviations, and using rank-reciprocal weighted averaging to produce a stable aggregate score.

I use GEMBA-MQM v2 to evaluate translation quality in my live speech-to-speech translation pipeline. MQM Multidimensional Quality Metrics is an open industry standard for grading translations. Instead of a single score, it classifies every error by type mistranslation, omission, hallucination, grammar, etc. and severity critical, major, minor . It's what professional linguists use when they review translations manually. GEMBA makes an LLM do this same annotation process. It prompts the model to read the source and translation side by side, find the errors, and tag each one with an MQM type and severity. So you get the same structured error breakdown you'd get from a human reviewer, but automated. It ranked 1 on WMT24 by correlation with human MQM annotations. The catch: LLM judges are noisy. On one English-to-German clip, 10 passes gave me scores from -29 to -109. Same translation, same model. The fix is straightforward. Run 10 passes per segment, drop outliers beyond 2 standard deviations, aggregate with rank-reciprocal weighted averaging so the harshest outlier doesn't dominate. That same clip settles at -41.9 across 10 passes. If you're using LLM-as-judge for anything, try running multiple passes. The variance will surprise you. Full methodology: LLMs as translation judges: Inside GEMBA-MQM v2 Code: VoiceFrom/live-s2st-eval