cd /news/natural-language-processing/how-i-improved-my-fact-checker-from-… · home topics natural-language-processing article
[ARTICLE · art-35265] src=dev.to ↗ pub= topic=natural-language-processing verified=true sentiment=↑ positive

How I improved my fact-checker from F1 0.655 0.813 — what actually changed

A developer improved a multilingual fact-checker's F1 score from 0.655 to 0.813 by fixing a fundamental input error: the model was trained on claims alone instead of claim-evidence pairs. The XLM-RoBERTa model fine-tuned on the FEVER dataset was originally trained without evidence, causing it to misfire on obvious facts. After concatenating claims with gold evidence sentences, the F1 score jumped 24%. The developer also implemented a retrieval pipeline using BGE embeddings and FAISS to handle real-world claims without pre-labeled evidence.

read3 min views1 publishedJun 21, 2026

I built a multilingual fact-checker using XLM-RoBERTa fine-tuned on the FEVER dataset. The first version hit F1 0.655. Not bad, but it kept misfiring on obvious real-world claims. Earth being the third planet from the Sun returned FALSE at 76% confidence. Something was fundamentally wrong.

A commenter identified the issue immediately: I was training the model on claims alone, with no evidence. FEVER is not a claim classification task. It's a Natural Language Inference task — the model is supposed to verify a claim against evidence, not guess from the claim text alone. I had been training it wrong from the start.

FEVER (Fact Extraction and VERification) contains 228,000+ Wikipedia claim-evidence pairs. Each claim is annotated as SUPPORTS, REFUTES, or NOT ENOUGH INFO based on retrieved Wikipedia sentences. The whole point is that the model sees both the claim and the evidence together and decides if the evidence supports or contradicts the claim.

Training on claims alone strips out all that signal. The model has nothing to reason about, it just memorizes surface patterns in the claim text.

The fix was straightforward: concatenate the claim with its gold evidence sentences before passing to the model. XLM-RoBERTa uses as a sentence separator, so the format becomes [claim] [evidence]. Fine-tuned for one epoch on the full FEVER training set, starting from the existing checkpoint. F1 jumped from 0.655 to 0.813.

The improvement wasn't from a better architecture, more data, or longer training. It was purely from feeding the model what it was designed to receive.

The retrained model was great on FEVER benchmarks but useless for real-world claims, because real-world claims don't come with pre-labeled Wikipedia evidence attached. You need to retrieve the evidence yourself.

For this, I used BGE (BAAI/bge-base-en-v1.5), a retrieval-optimized embedding model from Beijing Academy of AI. The approach is called Reverse HyDE — instead of generating a hypothetical document for the query, you embed the claim as a retrieval query and find the most semantically similar evidence passages. The FEVER passages are indexed in a FAISS flat inner product index, which gives cosine similarity over normalized vectors. At inference time: embed the claim, retrieve the top 3 most relevant FEVER passages, concatenate them with the claim, and pass to XLM-RoBERTa. The whole pipeline takes under a second.

The combined system — retrieval augmented XLM-RoBERTa — handled real-world claims correctly where the v1 model failed. Claims about historical facts, scientific facts, and geography all returned sensible verdicts with high confidence.

XLM-RoBERTa is pretrained on CommonCrawl data across 100 languages. This means the fine-tuned model inherits multilingual capability without any additional training. You can submit claims in Hindi, Spanish, Tamil, Arabic, or Chinese and the model understands them. The retrieved evidence is always English (since FEVER is English), but XLM-RoBERTa handles cross-lingual NLI reasonably well — the claim and evidence don't need to be in the same language.

The architecture did not change. The dataset did not change. The training duration did not change. What changed was understanding what the task actually requires and formatting the input accordingly. A 24% F1 improvement from fixing the input format is a good reminder to read the dataset paper before training.

huggingface.co/ashg2099/xlm-roberta-factchecker

── more in #natural-language-processing 4 stories · sorted by recency
── more on @xlm-roberta 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-i-improved-my-fa…] indexed:0 read:3min 2026-06-21 ·