How I improved my fact-checker from F1 0.655 0.813 — what actually changed

wpnews.pro

cd /news/natural-language-processing/how-i-improved-my-fact-checker-from-… · home › topics › natural-language-processing › article

[ARTICLE · art-35265] src=dev.to ↗ pub=2026-06-21T01:31Z topic=natural-language-processing verified=true sentiment=↑ positive

How I improved my fact-checker from F1 0.655 0.813 — what actually changed

A developer improved a multilingual fact-checker's F1 score from 0.655 to 0.813 by fixing a fundamental input error: the model was trained on claims alone instead of claim-evidence pairs. The XLM-RoBERTa model fine-tuned on the FEVER dataset was originally trained without evidence, causing it to misfire on obvious facts. After concatenating claims with gold evidence sentences, the F1 score jumped 24%. The developer also implemented a retrieval pipeline using BGE embeddings and FAISS to handle real-world claims without pre-labeled evidence.

read3 min views1 publishedJun 21, 2026

I built a multilingual fact-checker using XLM-RoBERTa fine-tuned on the FEVER dataset. The first version hit F1 0.655. Not bad, but it kept misfiring on obvious real-world claims. Earth being the third planet from the Sun returned FALSE at 76% confidence. Something was fundamentally wrong.

A commenter identified the issue immediately: I was training the model on claims alone, with no evidence. FEVER is not a claim classification task. It's a Natural Language Inference task — the model is supposed to verify a claim against evidence, not guess from the claim text alone. I had been training it wrong from the start.

FEVER (Fact Extraction and VERification) contains 228,000+ Wikipedia claim-evidence pairs. Each claim is annotated as SUPPORTS, REFUTES, or NOT ENOUGH INFO based on retrieved Wikipedia sentences. The whole point is that the model sees both the claim and the evidence together and decides if the evidence supports or contradicts the claim.

Training on claims alone strips out all that signal. The model has nothing to reason about, it just memorizes surface patterns in the claim text.

The fix was straightforward: concatenate the claim with its gold evidence sentences before passing to the model. XLM-RoBERTa uses as a sentence separator, so the format becomes [claim] [evidence]. Fine-tuned for one epoch on the full FEVER training set, starting from the existing checkpoint. F1 jumped from 0.655 to 0.813.

The improvement wasn't from a better architecture, more data, or longer training. It was purely from feeding the model what it was designed to receive.

The retrained model was great on FEVER benchmarks but useless for real-world claims, because real-world claims don't come with pre-labeled Wikipedia evidence attached. You need to retrieve the evidence yourself.

For this, I used BGE (BAAI/bge-base-en-v1.5), a retrieval-optimized embedding model from Beijing Academy of AI. The approach is called Reverse HyDE — instead of generating a hypothetical document for the query, you embed the claim as a retrieval query and find the most semantically similar evidence passages. The FEVER passages are indexed in a FAISS flat inner product index, which gives cosine similarity over normalized vectors. At inference time: embed the claim, retrieve the top 3 most relevant FEVER passages, concatenate them with the claim, and pass to XLM-RoBERTa. The whole pipeline takes under a second.

The combined system — retrieval augmented XLM-RoBERTa — handled real-world claims correctly where the v1 model failed. Claims about historical facts, scientific facts, and geography all returned sensible verdicts with high confidence.

XLM-RoBERTa is pretrained on CommonCrawl data across 100 languages. This means the fine-tuned model inherits multilingual capability without any additional training. You can submit claims in Hindi, Spanish, Tamil, Arabic, or Chinese and the model understands them. The retrieved evidence is always English (since FEVER is English), but XLM-RoBERTa handles cross-lingual NLI reasonably well — the claim and evidence don't need to be in the same language.

The architecture did not change. The dataset did not change. The training duration did not change. What changed was understanding what the task actually requires and formatting the input accordingly. A 24% F1 improvement from fixing the input format is a good reminder to read the dataset paper before training.

huggingface.co/ashg2099/xlm-roberta-factchecker

source & further reading

dev.to — original article perso — a WebAssembly policy engine that decides what your MCP agent is allowed to do Connecting an MCP server gives your agent hands. It also gives a stranger a way in. "Coding is over, Software is not" — the line that nails AI coding's biggest misunderstanding

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-i-improved-my-fact-c…

Read original on dev.to → dev.to/ashg2099/how-i-improved-my-fact-checker-f…

mentioned entities

XLM-RoBERTa

FEVER

BGE

FAISS

Beijing Academy of AI

CommonCrawl

metadata

slughow-i-improved-my-fact-checker-from-f1-0-655-0-813-what-actually-changed

topic#natural-language-processing

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevZ.ai's GLM-5.2 vs Gemini on Agen…

next →Build Rails, Not Trains: A Frame…

── more in #natural-language-processing 4 stories · sorted by recency

dev.to · 21 Jun · #natural-language-processing

Precision Medicine RAG: Building a Clinical Trial Search Engine with Hybrid Search and BGE-M3

dev.to · 28 May · #natural-language-processing

RAG SOTA: I Tested 7 Pipelines and Built SEQUOIA (Open Source)

letsdatascience.com · 21 Jun · #natural-language-processing

Zalando Presents MLLM-Based Product Retrieval Evaluation

dev.to · 21 Jun · #natural-language-processing

Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything

── more on @xlm-roberta 3 stories trending now

wpnews · 20 Jun · #ai-safety

SR 11-7 Model Risk for AI Systems: What Banks Actually Need to Build

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required