{"slug": "how-i-improved-my-fact-checker-from-f1-0-655-0-813-what-actually-changed", "title": "How I improved my fact-checker from F1 0.655 0.813 — what actually changed", "summary": "A developer improved a multilingual fact-checker's F1 score from 0.655 to 0.813 by fixing a fundamental input error: the model was trained on claims alone instead of claim-evidence pairs. The XLM-RoBERTa model fine-tuned on the FEVER dataset was originally trained without evidence, causing it to misfire on obvious facts. After concatenating claims with gold evidence sentences, the F1 score jumped 24%. The developer also implemented a retrieval pipeline using BGE embeddings and FAISS to handle real-world claims without pre-labeled evidence.", "body_md": "I built a multilingual fact-checker using XLM-RoBERTa fine-tuned on the FEVER dataset. The first version hit F1 0.655. Not bad, but it kept misfiring on obvious real-world claims. Earth being the third planet from the Sun returned FALSE at 76% confidence. Something was fundamentally wrong.\n\nA commenter identified the issue immediately: I was training the model on claims alone, with no evidence. FEVER is not a claim classification task. It's a Natural Language Inference task — the model is supposed to verify a claim against evidence, not guess from the claim text alone. I had been training it wrong from the start.\n\nFEVER (Fact Extraction and VERification) contains 228,000+ Wikipedia claim-evidence pairs. Each claim is annotated as SUPPORTS, REFUTES, or NOT ENOUGH INFO based on retrieved Wikipedia sentences. The whole point is that the model sees both the claim and the evidence together and decides if the evidence supports or contradicts the claim.\n\nTraining on claims alone strips out all that signal. The model has nothing to reason about, it just memorizes surface patterns in the claim text.\n\nThe fix was straightforward: concatenate the claim with its gold evidence sentences before passing to the model. XLM-RoBERTa uses as a sentence separator, so the format becomes [claim] [evidence]. Fine-tuned for one epoch on the full FEVER training set, starting from the existing checkpoint. F1 jumped from 0.655 to 0.813.\n\nThe improvement wasn't from a better architecture, more data, or longer training. It was purely from feeding the model what it was designed to receive.\n\nThe retrained model was great on FEVER benchmarks but useless for real-world claims, because real-world claims don't come with pre-labeled Wikipedia evidence attached. You need to retrieve the evidence yourself.\n\nFor this, I used BGE (BAAI/bge-base-en-v1.5), a retrieval-optimized embedding model from Beijing Academy of AI. The approach is called Reverse HyDE — instead of generating a hypothetical document for the query, you embed the claim as a retrieval query and find the most semantically similar evidence passages. The FEVER passages are indexed in a FAISS flat inner product index, which gives cosine similarity over normalized vectors.\n\nAt inference time: embed the claim, retrieve the top 3 most relevant FEVER passages, concatenate them with the claim, and pass to XLM-RoBERTa. The whole pipeline takes under a second.\n\nThe combined system — retrieval augmented XLM-RoBERTa — handled real-world claims correctly where the v1 model failed. Claims about historical facts, scientific facts, and geography all returned sensible verdicts with high confidence.\n\nXLM-RoBERTa is pretrained on CommonCrawl data across 100 languages. This means the fine-tuned model inherits multilingual capability without any additional training. You can submit claims in Hindi, Spanish, Tamil, Arabic, or Chinese and the model understands them. The retrieved evidence is always English (since FEVER is English), but XLM-RoBERTa handles cross-lingual NLI reasonably well — the claim and evidence don't need to be in the same language.\n\nThe architecture did not change. The dataset did not change. The training duration did not change. What changed was understanding what the task actually requires and formatting the input accordingly. A 24% F1 improvement from fixing the input format is a good reminder to read the dataset paper before training.\n\nhuggingface.co/ashg2099/xlm-roberta-factchecker", "url": "https://wpnews.pro/news/how-i-improved-my-fact-checker-from-f1-0-655-0-813-what-actually-changed", "canonical_source": "https://dev.to/ashg2099/how-i-improved-my-fact-checker-from-f1-0655-0813-what-actually-changed-455a", "published_at": "2026-06-21 01:31:47+00:00", "updated_at": "2026-06-21 02:36:50.154149+00:00", "lang": "en", "topics": ["natural-language-processing", "machine-learning", "large-language-models", "ai-research", "developer-tools"], "entities": ["XLM-RoBERTa", "FEVER", "BGE", "FAISS", "Beijing Academy of AI", "CommonCrawl"], "alternates": {"html": "https://wpnews.pro/news/how-i-improved-my-fact-checker-from-f1-0-655-0-813-what-actually-changed", "markdown": "https://wpnews.pro/news/how-i-improved-my-fact-checker-from-f1-0-655-0-813-what-actually-changed.md", "text": "https://wpnews.pro/news/how-i-improved-my-fact-checker-from-f1-0-655-0-813-what-actually-changed.txt", "jsonld": "https://wpnews.pro/news/how-i-improved-my-fact-checker-from-f1-0-655-0-813-what-actually-changed.jsonld"}}