I Built an Open-Source Multi-Agent Fact-Checker — Here's How It Works

A developer built Sift (Source Inspection & Fact-checking Tool), an open-source multi-agent AI pipeline that extracts factual claims from text, retrieves grounded evidence, and returns auditable verdicts of TRUE, FALSE, or UNCERTAIN with cited sources. The system uses five specialized agents orchestrated with LangGraph, including HyDE retrieval across indexed Guardian and Wikipedia chunks plus live web search, to overcome LLM hallucination and false confidence issues. Sift also includes an adversarial review agent and a correction agent that surfaces accurate information for false or uncertain claims.

We have a misinformation problem. But more specifically, we have a speed problem. A journalist spots a suspicious claim. They search for sources. Cross-reference databases. Call experts. Write a verdict. Get it edited. Publish, maybe 6 hours later. Maybe 3 days later. Meanwhile, the original claim has been screenshot, reposted, quoted in newsletters, and cited in arguments across five platforms. I wanted to build something that closed that gap. Not a chatbot that guesses. A proper pipeline, one that retrieves real evidence, reasons from it, and tells you why it reached a verdict. That's what Sift is. Sift Source Inspection & Fact-checking Tool is an open-source multi-agent AI pipeline that takes any text, extracts every factual claim, retrieves grounded evidence, and returns auditable verdicts — TRUE, FALSE, or UNCERTAIN, with cited sources and full reasoning chains. Paste a news article. A politician's speech. A viral statistic. A WhatsApp forward. Sift breaks it into individual claims and fact-checks each one independently. The naive approach is to ask an LLM: "Is this claim true?" The problem: LLMs hallucinate. They have knowledge cutoffs. They're confidently wrong in ways that are hard to detect. And critically, they don't show their work. A single LLM call can't reliably handle the full pipeline of: Each of these is a distinct task that benefits from its own prompt, its own tools, and its own failure modes. That's why I built five separate agents, orchestrated with LangGraph. A single paragraph can contain 4-5 distinct factual claims. Generic LLMs miss them or conflate them. This agent uses LLaMA 3.3 70B via Groq with Pydantic structured output to extract every distinct verifiable claim from the input text. The output is a typed list of claims — exact text, no paraphrasing, no hallucination. LLMs hallucinate citations. You need real, retrievable, dated evidence. This agent runs HyDE retrieval across 4,270 indexed Guardian + Wikipedia chunks stored in pgvector, then hits Tavily live web search for recent data. Why HyDE instead of standard RAG? Standard RAG embeds the raw claim and searches for similar text. A short factual claim like "The Fed raised rates in March 2024" has a weak semantic signal on its own. HyDE Hypothetical Document Embeddings generates a hypothetical document that would contain the answer — something like a news article excerpt — then embeds that. The result is a richer semantic signal and significantly better retrieval recall on short factual claims. This agent reasons strictly from retrieved evidence. It returns TRUE / FALSE / UNCERTAIN with a calibrated confidence score. Critically — if evidence is thin or conflicting, it returns UNCERTAIN instead of confabulating certainty. This was one of the hardest things to get right. LLMs naturally trend toward false confidence. I had to explicitly prompt for epistemic humility and add Pydantic validators to catch zero-confidence outputs. Synthesis agents tend toward overconfidence when evidence partially supports a claim. You need an adversarial check. This agent independently reviews every verdict. It flags unsupported reasoning, catches cases where 1.1°C vs 1.19°C is a rounding difference, not a false claim, and adjusts confidence downward when warranted. This is the step most fact-checking systems skip — and it's the one that matters most for borderline claims. Knowing something is false isn't enough. Users need to know what IS true. This agent fires only on FALSE or UNCERTAIN verdicts. It runs a targeted live search to find the correct information and surfaces it with a cited source. Conditional — doesn't waste tokens on TRUE verdicts. The pipeline isn't linear for every claim. Some claims have no evidence — they skip synthesis and go straight to the criticism. Some need multiple retrieval attempts. Some claims loop. LangGraph's state machine handles conditional branching, loops, and shared state across agents cleanly. The state is typed with TypedDict — every agent reads from and writes to the same state object. FastAPI returns a task ID immediately. Celery + Redis runs the pipeline in the background. The client polls for results. Redis cache stores results for 7 days — the same viral claim doesn't cost tokens twice. Cache hits at the API layer return in under 1 second, before Celery even runs. LangFuse traces every LLM call — prompt, output, latency, token count — so I can debug agent failures without guessing. LLM: LLaMA 3.3 70B via Groq API Embeddings: all-MiniLM-L6-v2 via HuggingFace Inference API Orchestration: LangGraph state machine RAG: HyDE + pgvector hybrid search Vector DB: PostgreSQL + pgvector API: FastAPI + Pydantic Task Queue: Celery + Redis Evidence Sources: Tavily live + Guardian API + Wikipedia Observability: LangFuse + Prometheus + Grafana The project is fully open source and Dockerized. One command runs the entire stack: git clone https://github.com/ashg2099/Sift.git cd Sift cp .env.example .env Add your API keys Groq, Tavily, HuggingFace — all free tiers docker compose up Open http://localhost:8000 and start verifying claims. GitHub: https://github.com/ashg2099/Sift https://github.com/ashg2099/Sift LinkedIn: https://www.linkedin.com/in/ashwin-gururaj-93943816a/ https://www.linkedin.com/in/ashwin-gururaj-93943816a/