# How I improved my fact-checker from F1 0.655 0.813 — what actually changed

> Source: <https://dev.to/ashg2099/how-i-improved-my-fact-checker-from-f1-0655-0813-what-actually-changed-455a>
> Published: 2026-06-21 01:31:47+00:00

I built a multilingual fact-checker using XLM-RoBERTa fine-tuned on the FEVER dataset. The first version hit F1 0.655. Not bad, but it kept misfiring on obvious real-world claims. Earth being the third planet from the Sun returned FALSE at 76% confidence. Something was fundamentally wrong.

A commenter identified the issue immediately: I was training the model on claims alone, with no evidence. FEVER is not a claim classification task. It's a Natural Language Inference task — the model is supposed to verify a claim against evidence, not guess from the claim text alone. I had been training it wrong from the start.

FEVER (Fact Extraction and VERification) contains 228,000+ Wikipedia claim-evidence pairs. Each claim is annotated as SUPPORTS, REFUTES, or NOT ENOUGH INFO based on retrieved Wikipedia sentences. The whole point is that the model sees both the claim and the evidence together and decides if the evidence supports or contradicts the claim.

Training on claims alone strips out all that signal. The model has nothing to reason about, it just memorizes surface patterns in the claim text.

The fix was straightforward: concatenate the claim with its gold evidence sentences before passing to the model. XLM-RoBERTa uses as a sentence separator, so the format becomes [claim] [evidence]. Fine-tuned for one epoch on the full FEVER training set, starting from the existing checkpoint. F1 jumped from 0.655 to 0.813.

The improvement wasn't from a better architecture, more data, or longer training. It was purely from feeding the model what it was designed to receive.

The retrained model was great on FEVER benchmarks but useless for real-world claims, because real-world claims don't come with pre-labeled Wikipedia evidence attached. You need to retrieve the evidence yourself.

For this, I used BGE (BAAI/bge-base-en-v1.5), a retrieval-optimized embedding model from Beijing Academy of AI. The approach is called Reverse HyDE — instead of generating a hypothetical document for the query, you embed the claim as a retrieval query and find the most semantically similar evidence passages. The FEVER passages are indexed in a FAISS flat inner product index, which gives cosine similarity over normalized vectors.

At inference time: embed the claim, retrieve the top 3 most relevant FEVER passages, concatenate them with the claim, and pass to XLM-RoBERTa. The whole pipeline takes under a second.

The combined system — retrieval augmented XLM-RoBERTa — handled real-world claims correctly where the v1 model failed. Claims about historical facts, scientific facts, and geography all returned sensible verdicts with high confidence.

XLM-RoBERTa is pretrained on CommonCrawl data across 100 languages. This means the fine-tuned model inherits multilingual capability without any additional training. You can submit claims in Hindi, Spanish, Tamil, Arabic, or Chinese and the model understands them. The retrieved evidence is always English (since FEVER is English), but XLM-RoBERTa handles cross-lingual NLI reasonably well — the claim and evidence don't need to be in the same language.

The architecture did not change. The dataset did not change. The training duration did not change. What changed was understanding what the task actually requires and formatting the input accordingly. A 24% F1 improvement from fixing the input format is a good reminder to read the dataset paper before training.

huggingface.co/ashg2099/xlm-roberta-factchecker
