{"slug": "i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of", "title": "I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python", "summary": "A developer achieved 96% recall on LLM hallucination detection using just 50 lines of Python and four statistical signals, without any machine learning model, GPU, or external API. Tested on 10,000 examples from the HaluEval dataset, the approach uses length ratio, unknown word rate, question-answer overlap, and numeric inconsistency to produce a combined score with a tunable threshold. The soft flag setting yielded precision of 0.71 and recall of 0.96, while the strict setting achieved precision of 1.00 with recall of 0.38.", "body_md": "Most hallucination detection approaches tell you to train another model. I did not want to do that. I used four statistical signals, a combined score, and a tunable threshold. No fine-tuning. No GPU. No external API. Tested on 10,000 real examples from the HaluEval dataset.\n\nSoft flag result: precision 0.71, recall 0.96.\n\nStrict flag result: precision 1.00, recall 0.38.\n\nHere’s how it works.\n\nApproaches like SelfCheckGPT require multiple model samples and significant compute. That adds up fast when you are scoring thousands of answers a day. You also end up with a black box sitting on top of another black box. When something goes wrong, you have no idea which layer failed.\n\nI wanted something where every flag has a reason you can actually read.\n\nHallucination answers behave differently from grounded ones in ways you can measure. You do not need a model for this. You just need to look at the right things.\n\nFour signals ended up doing most of the work.\n\n**Signal 1: Length Ratio**\n\nWhen a model does not know the answer, it pads. It generates more text to sound convincing instead of staying close to the facts.\n\n```\ndf['answer_len'] = df['answer'].str.split().str.len() df['knowledge_len'] = df['knowledge'].str.split().str.len() df['length_ratio'] = df['answer_len'] / df['knowledge_len']\n```\n\nAverage length ratio: hallucinated 0.22 vs not hallucinated 0.05\n\n**Signal 2: Unknown Word Rate**\n\nGrounded answers stay close to the source. Hallucinated answers introduce words that never appeared in the reference text.\n\n``` python\ndef unknown_word_rate(row): \nknowledge_words = set(str(row['knowledge']).lower().split()) \nanswer_words = set(str(row['answer']).lower().split()) \nif not answer_words: \n    return 0 \nunknown = answer_words - knowledge_words \nreturn len(unknown) / len(answer_words)\n```\n\nAverage unknown word rate: hallucinated 0.46 vs not hallucinated 0.30\n\n**Signal 3: Question-Answer Overlap**\n\nWhen a model fabricates, it often just echoes the question back. Instead of pulling from the source, it repeats the question words in the answer.\n\n``` python\ndef question_answer_overlap(row): \nquestion_words = set(str(row['question']).lower().split()) \nanswer_words = set(str(row['answer']).lower().split()) \nif not question_words: \n   return 0 \noverlap = question_words & answer_words \nreturn len(overlap) / len(question_words)\n```\n\nAverage overlap: hallucinated 0.39 vs not hallucinated 0.02\n\n**Signal 4: Numeric Inconsistency**\n\nNumbers are where models hallucinate most confidently. The general concept might be right but the date, quantity, or statistic is just wrong.\n\n``` python\ndef numeric_inconsistency(row): \nknowledge_nums = set(re.findall(r'\\b\\d+\\b', str(row['knowledge']))) \nanswer_nums = set(re.findall(r'\\b\\d+\\b', str(row['answer']))) \nif not answer_nums: \n   return 0 \ninconsistent = answer_nums - knowledge_nums\nreturn len(inconsistent) / len(answer_nums)\n```\n\nAverage numeric inconsistency: hallucinated 0.087 vs not hallucinated 0.0001\n\nEach signal contributes one point if it crosses its threshold. Every answer gets a score from 0 to 4.\n\n```\ndf['score'] = ( \n(df['length_ratio'] > 0.1).astype(int) + \n(df['unknown_word_rate'] > 0.4).astype(int) + \n(df['qa_overlap'] > 0.2).astype(int) + \n(df['numeric_inconsistency'] > 0.5).astype(int) \n)\n```\n\nNot hallucinated answers cluster at 0 and 1. Hallucinated answers clustered at 2, 3, and 4.\n\nAverage score: hallucinated 2.18 vs not hallucinated 0.39\n\nSoft flag (score >= 1): precision 0.71, recall 0.96 Use this when missing a hallucination costs more than a false alarm. Think financial services, healthcare, legal.\n\nStrict flag (score >= 3): precision 1.00, recall 0.38 Use this when your review capacity is limited and you only want the obvious cases.\n\nYou can tune the threshold without retraining anything. That matters in production.\n\n**Plugging It In**\n\n``` python\ndef score_answer(knowledge, question, answer): \nknowledge_words = set(str(knowledge).lower().split()) \nanswer_words = set(str(answer).lower().split()) \nquestion_words = set(str(question).lower().split()) \nknowledge_nums = set(re.findall(r'\\b\\d+\\b', str(knowledge))) \nanswer_nums = set(re.findall(r'\\b\\d+\\b', str(answer))) \n\nanswer_len = len(answer_words) \nknowledge_len = len(knowledge_words) if knowledge_words else 1 \n\nlength_ratio = answer_len / knowledge_len \nunknown_word_rate = len(answer_words - knowledge_words) / len(answer_words) if answer_words else 0 \nqa_overlap = len(question_words & answer_words) / len(question_words) if question_words else 0 \nnumeric_inconsistency = len(answer_nums - knowledge_nums) / len(answer_nums) if answer_nums else 0 \nscore = ( \n                    int(length_ratio > 0.1) + \n        int(unknown_word_rate > 0.4) + \n        int(qa_overlap > 0.2) + \n        int(numeric_inconsistency > 0.5) \n) \nreturn score\n\nscore = score_answer(knowledge, question, answer) \nif score >= 3: \naction = \"block\" \nelif score >= 1: \naction = \"flag\" \nelse: \n action = \"pass\"\n```\n\nruns in milliseconds. No model to load, no GPU, no API call. Log the score and individual signal values for every answer. Over time that becomes your calibration dataset.\n\n**Hallucinated, score 3/4**\n\nQuestion: What U.S. highway gives access to Zilpo Road, and is also known as Midland Trail? Answer: It's actually Zilpo Road that is known as Midland Trail, not US 60.\n\nThe model deflected and contradicted the source instead of answering. Caught.\n\n**Hallucinated, score 3/4**\n\nQuestion: Dua Lipa's debut album spawned \"New Rules\" — in what year was it released? Answer: The album was released in 2018.\n\nThe correct year is 2017. Confident, wrong, numeric flag caught it.\n\n**Not hallucinated, score 0/4**\n\nQuestion: The Dutch-Belgian series \"House of Anubis\" was based on — first aired in what year? Answer: 2006.\n\nCorrect, grounded, one word. Score zero.\n\nThis only works if you have source knowledge to compare against. It does not apply to open-ended generation without a retrievable source. Best fit is RAG pipelines and QA systems.\n\nIt uses word-level matching, not semantic understanding. A hallucination that paraphrases the source closely might slip through. The thresholds were tuned on HaluEval so if you are working in a specialized domain, recalibrate on your own data first.\n\nPrecision of 0.71 on the soft flag means about 3 in 10 flags are false alarms. That is a tradeoff, not a flaw. Monitor it.\n\nAI produces what it receives. If the outputs are not being validated, you will not know what you are getting. This framework is one way to start checking without adding a lot of infrastructure.\n\nFull code on GitHub: github.com/ritikade2/llm-hallucination-detector", "url": "https://wpnews.pro/news/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of", "canonical_source": "https://dev.to/ritika_2603/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of-python-39d6", "published_at": "2026-05-25 20:15:46+00:00", "updated_at": "2026-05-25 21:03:29.203903+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-research", "ai-tools"], "entities": ["HaluEval", "SelfCheckGPT"], "alternates": {"html": "https://wpnews.pro/news/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of", "markdown": "https://wpnews.pro/news/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of.md", "text": "https://wpnews.pro/news/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of.txt", "jsonld": "https://wpnews.pro/news/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of.jsonld"}}