# I Got 96% Recall on LLM Hallucination Detection With No ML Model – Just 50 Lines of Python

> Source: <https://dev.to/ritika_2603/i-got-96-recall-on-llm-hallucination-detection-with-no-ml-model-just-50-lines-of-python-39d6>
> Published: 2026-05-25 20:15:46+00:00

Most hallucination detection approaches tell you to train another model. I did not want to do that. I used four statistical signals, a combined score, and a tunable threshold. No fine-tuning. No GPU. No external API. Tested on 10,000 real examples from the HaluEval dataset.

Soft flag result: precision 0.71, recall 0.96.

Strict flag result: precision 1.00, recall 0.38.

Here’s how it works.

Approaches like SelfCheckGPT require multiple model samples and significant compute. That adds up fast when you are scoring thousands of answers a day. You also end up with a black box sitting on top of another black box. When something goes wrong, you have no idea which layer failed.

I wanted something where every flag has a reason you can actually read.

Hallucination answers behave differently from grounded ones in ways you can measure. You do not need a model for this. You just need to look at the right things.

Four signals ended up doing most of the work.

**Signal 1: Length Ratio**

When a model does not know the answer, it pads. It generates more text to sound convincing instead of staying close to the facts.

```
df['answer_len'] = df['answer'].str.split().str.len() df['knowledge_len'] = df['knowledge'].str.split().str.len() df['length_ratio'] = df['answer_len'] / df['knowledge_len']
```

Average length ratio: hallucinated 0.22 vs not hallucinated 0.05

**Signal 2: Unknown Word Rate**

Grounded answers stay close to the source. Hallucinated answers introduce words that never appeared in the reference text.

``` python
def unknown_word_rate(row): 
knowledge_words = set(str(row['knowledge']).lower().split()) 
answer_words = set(str(row['answer']).lower().split()) 
if not answer_words: 
    return 0 
unknown = answer_words - knowledge_words 
return len(unknown) / len(answer_words)
```

Average unknown word rate: hallucinated 0.46 vs not hallucinated 0.30

**Signal 3: Question-Answer Overlap**

When a model fabricates, it often just echoes the question back. Instead of pulling from the source, it repeats the question words in the answer.

``` python
def question_answer_overlap(row): 
question_words = set(str(row['question']).lower().split()) 
answer_words = set(str(row['answer']).lower().split()) 
if not question_words: 
   return 0 
overlap = question_words & answer_words 
return len(overlap) / len(question_words)
```

Average overlap: hallucinated 0.39 vs not hallucinated 0.02

**Signal 4: Numeric Inconsistency**

Numbers are where models hallucinate most confidently. The general concept might be right but the date, quantity, or statistic is just wrong.

``` python
def numeric_inconsistency(row): 
knowledge_nums = set(re.findall(r'\b\d+\b', str(row['knowledge']))) 
answer_nums = set(re.findall(r'\b\d+\b', str(row['answer']))) 
if not answer_nums: 
   return 0 
inconsistent = answer_nums - knowledge_nums
return len(inconsistent) / len(answer_nums)
```

Average numeric inconsistency: hallucinated 0.087 vs not hallucinated 0.0001

Each signal contributes one point if it crosses its threshold. Every answer gets a score from 0 to 4.

```
df['score'] = ( 
(df['length_ratio'] > 0.1).astype(int) + 
(df['unknown_word_rate'] > 0.4).astype(int) + 
(df['qa_overlap'] > 0.2).astype(int) + 
(df['numeric_inconsistency'] > 0.5).astype(int) 
)
```

Not hallucinated answers cluster at 0 and 1. Hallucinated answers clustered at 2, 3, and 4.

Average score: hallucinated 2.18 vs not hallucinated 0.39

Soft flag (score >= 1): precision 0.71, recall 0.96 Use this when missing a hallucination costs more than a false alarm. Think financial services, healthcare, legal.

Strict flag (score >= 3): precision 1.00, recall 0.38 Use this when your review capacity is limited and you only want the obvious cases.

You can tune the threshold without retraining anything. That matters in production.

**Plugging It In**

``` python
def score_answer(knowledge, question, answer): 
knowledge_words = set(str(knowledge).lower().split()) 
answer_words = set(str(answer).lower().split()) 
question_words = set(str(question).lower().split()) 
knowledge_nums = set(re.findall(r'\b\d+\b', str(knowledge))) 
answer_nums = set(re.findall(r'\b\d+\b', str(answer))) 

answer_len = len(answer_words) 
knowledge_len = len(knowledge_words) if knowledge_words else 1 

length_ratio = answer_len / knowledge_len 
unknown_word_rate = len(answer_words - knowledge_words) / len(answer_words) if answer_words else 0 
qa_overlap = len(question_words & answer_words) / len(question_words) if question_words else 0 
numeric_inconsistency = len(answer_nums - knowledge_nums) / len(answer_nums) if answer_nums else 0 
score = ( 
                    int(length_ratio > 0.1) + 
        int(unknown_word_rate > 0.4) + 
        int(qa_overlap > 0.2) + 
        int(numeric_inconsistency > 0.5) 
) 
return score

score = score_answer(knowledge, question, answer) 
if score >= 3: 
action = "block" 
elif score >= 1: 
action = "flag" 
else: 
 action = "pass"
```

runs in milliseconds. No model to load, no GPU, no API call. Log the score and individual signal values for every answer. Over time that becomes your calibration dataset.

**Hallucinated, score 3/4**

Question: What U.S. highway gives access to Zilpo Road, and is also known as Midland Trail? Answer: It's actually Zilpo Road that is known as Midland Trail, not US 60.

The model deflected and contradicted the source instead of answering. Caught.

**Hallucinated, score 3/4**

Question: Dua Lipa's debut album spawned "New Rules" — in what year was it released? Answer: The album was released in 2018.

The correct year is 2017. Confident, wrong, numeric flag caught it.

**Not hallucinated, score 0/4**

Question: The Dutch-Belgian series "House of Anubis" was based on — first aired in what year? Answer: 2006.

Correct, grounded, one word. Score zero.

This only works if you have source knowledge to compare against. It does not apply to open-ended generation without a retrievable source. Best fit is RAG pipelines and QA systems.

It uses word-level matching, not semantic understanding. A hallucination that paraphrases the source closely might slip through. The thresholds were tuned on HaluEval so if you are working in a specialized domain, recalibrate on your own data first.

Precision of 0.71 on the soft flag means about 3 in 10 flags are false alarms. That is a tradeoff, not a flaw. Monitor it.

AI produces what it receives. If the outputs are not being validated, you will not know what you are getting. This framework is one way to start checking without adding a lot of infrastructure.

Full code on GitHub: github.com/ritikade2/llm-hallucination-detector