I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python

A developer achieved 96% recall on LLM hallucination detection using just 50 lines of Python and four statistical signals, without any machine learning model, GPU, or external API. Tested on 10,000 examples from the HaluEval dataset, the approach uses length ratio, unknown word rate, question-answer overlap, and numeric inconsistency to produce a combined score with a tunable threshold. The soft flag setting yielded precision of 0.71 and recall of 0.96, while the strict setting achieved precision of 1.00 with recall of 0.38.

Most hallucination detection approaches tell you to train another model. I did not want to do that. I used four statistical signals, a combined score, and a tunable threshold. No fine-tuning. No GPU. No external API. Tested on 10,000 real examples from the HaluEval dataset. Soft flag result: precision 0.71, recall 0.96. Strict flag result: precision 1.00, recall 0.38. Here’s how it works. Approaches like SelfCheckGPT require multiple model samples and significant compute. That adds up fast when you are scoring thousands of answers a day. You also end up with a black box sitting on top of another black box. When something goes wrong, you have no idea which layer failed. I wanted something where every flag has a reason you can actually read. Hallucination answers behave differently from grounded ones in ways you can measure. You do not need a model for this. You just need to look at the right things. Four signals ended up doing most of the work. Signal 1: Length Ratio When a model does not know the answer, it pads. It generates more text to sound convincing instead of staying close to the facts. df 'answer len' = df 'answer' .str.split .str.len df 'knowledge len' = df 'knowledge' .str.split .str.len df 'length ratio' = df 'answer len' / df 'knowledge len' Average length ratio: hallucinated 0.22 vs not hallucinated 0.05 Signal 2: Unknown Word Rate Grounded answers stay close to the source. Hallucinated answers introduce words that never appeared in the reference text. python def unknown word rate row : knowledge words = set str row 'knowledge' .lower .split answer words = set str row 'answer' .lower .split if not answer words: return 0 unknown = answer words - knowledge words return len unknown / len answer words Average unknown word rate: hallucinated 0.46 vs not hallucinated 0.30 Signal 3: Question-Answer Overlap When a model fabricates, it often just echoes the question back. Instead of pulling from the source, it repeats the question words in the answer. python def question answer overlap row : question words = set str row 'question' .lower .split answer words = set str row 'answer' .lower .split if not question words: return 0 overlap = question words & answer words return len overlap / len question words Average overlap: hallucinated 0.39 vs not hallucinated 0.02 Signal 4: Numeric Inconsistency Numbers are where models hallucinate most confidently. The general concept might be right but the date, quantity, or statistic is just wrong. python def numeric inconsistency row : knowledge nums = set re.findall r'\b\d+\b', str row 'knowledge' answer nums = set re.findall r'\b\d+\b', str row 'answer' if not answer nums: return 0 inconsistent = answer nums - knowledge nums return len inconsistent / len answer nums Average numeric inconsistency: hallucinated 0.087 vs not hallucinated 0.0001 Each signal contributes one point if it crosses its threshold. Every answer gets a score from 0 to 4. df 'score' = df 'length ratio' 0.1 .astype int + df 'unknown word rate' 0.4 .astype int + df 'qa overlap' 0.2 .astype int + df 'numeric inconsistency' 0.5 .astype int Not hallucinated answers cluster at 0 and 1. Hallucinated answers clustered at 2, 3, and 4. Average score: hallucinated 2.18 vs not hallucinated 0.39 Soft flag score = 1 : precision 0.71, recall 0.96 Use this when missing a hallucination costs more than a false alarm. Think financial services, healthcare, legal. Strict flag score = 3 : precision 1.00, recall 0.38 Use this when your review capacity is limited and you only want the obvious cases. You can tune the threshold without retraining anything. That matters in production. Plugging It In python def score answer knowledge, question, answer : knowledge words = set str knowledge .lower .split answer words = set str answer .lower .split question words = set str question .lower .split knowledge nums = set re.findall r'\b\d+\b', str knowledge answer nums = set re.findall r'\b\d+\b', str answer answer len = len answer words knowledge len = len knowledge words if knowledge words else 1 length ratio = answer len / knowledge len unknown word rate = len answer words - knowledge words / len answer words if answer words else 0 qa overlap = len question words & answer words / len question words if question words else 0 numeric inconsistency = len answer nums - knowledge nums / len answer nums if answer nums else 0 score = int length ratio 0.1 + int unknown word rate 0.4 + int qa overlap 0.2 + int numeric inconsistency 0.5 return score score = score answer knowledge, question, answer if score = 3: action = "block" elif score = 1: action = "flag" else: action = "pass" runs in milliseconds. No model to load, no GPU, no API call. Log the score and individual signal values for every answer. Over time that becomes your calibration dataset. Hallucinated, score 3/4 Question: What U.S. highway gives access to Zilpo Road, and is also known as Midland Trail? Answer: It's actually Zilpo Road that is known as Midland Trail, not US 60. The model deflected and contradicted the source instead of answering. Caught. Hallucinated, score 3/4 Question: Dua Lipa's debut album spawned "New Rules" — in what year was it released? Answer: The album was released in 2018. The correct year is 2017. Confident, wrong, numeric flag caught it. Not hallucinated, score 0/4 Question: The Dutch-Belgian series "House of Anubis" was based on — first aired in what year? Answer: 2006. Correct, grounded, one word. Score zero. This only works if you have source knowledge to compare against. It does not apply to open-ended generation without a retrievable source. Best fit is RAG pipelines and QA systems. It uses word-level matching, not semantic understanding. A hallucination that paraphrases the source closely might slip through. The thresholds were tuned on HaluEval so if you are working in a specialized domain, recalibrate on your own data first. Precision of 0.71 on the soft flag means about 3 in 10 flags are false alarms. That is a tradeoff, not a flaw. Monitor it. AI produces what it receives. If the outputs are not being validated, you will not know what you are getting. This framework is one way to start checking without adding a lot of infrastructure. Full code on GitHub: github.com/ritikade2/llm-hallucination-detector