How I Stopped Fighting Hallucinations in LLM Data Extraction

wpnews.pro

cd /news/large-language-models/how-i-stopped-fighting-hallucination… · home › topics › large-language-models › article

[ARTICLE · art-46353] src=dev.to ↗ pub=2026-07-01T10:01Z topic=large-language-models verified=true sentiment=↑ positive

How I Stopped Fighting Hallucinations in LLM Data Extraction

A developer building an LLM-based invoice data extraction system found that naive prompting led to frequent hallucinations and only 60-70% accuracy. By switching to a validated generation approach using Pydantic schemas and OpenAI's structured outputs with retry logic, they achieved more reliable extraction. The method separates the concerns of natural language understanding and structured output validation.

read4 min views1 publishedJul 1, 2026

We all know the feeling. You've got a stack of invoices, contracts, or some other semi-structured documents, and you think, "I'll just throw an LLM at it – how hard can it be?"

Hard. Very hard. At least, that was my experience last month.

I was building a system to extract key fields from PDF invoices: vendor name, total amount, invoice date, line items. Seemed straightforward. I'd used GPT-4 before, and it's great at understanding natural language. How wrong I was.

I wrote a simple system prompt:

Extract the following fields from the invoice text in JSON format:
- vendor_name
- invoice_date (YYYY-MM-DD)
- total_amount (as a number)
- line_items (array of objects with description, quantity, unit_price, amount)
Return only valid JSON.

Then I fed it the OCR output. It worked maybe 60% of the time. The rest? Hallucinations. Wrong field names like "vendor" instead of "vendor_name". Dates in various formats like "March 5th, 2024". Numbers with currency symbols attached. Sometimes it would add extra fields. Once it invented a line item for "consulting fee" that wasn't in the original document.

I spent a day tweaking prompts. "Be precise." "Don't invent data." "Use exactly these field names." It helped a little, but still maybe 70% success. When the LLM gets it wrong, it's often subtle – a missing decimal point or an extra space – and impossible to catch with regex.

I added 5 example invoices with correct outputs. Success rate crept to 80%. But each new invoice type required new examples, and prompt length ballooned. And it still hallucinated when the document layout was unusual.

Setting temperature to 0 helped – but it also made the model too rigid. Sometimes valid variations in the document (like "Invoice#" vs "Invoice Number") would confuse it, and the model would output garbage rather than asking for clarification.

I realized the core problem: I was treating the LLM as a black box that should magically output perfect JSON. Instead, I needed to separate the concerns:

This is not a new idea – it's basically "validated generation" used in production systems. But implementing it well required a few pieces.

Instead of hoping for correct field names, I defined the exact structure I wanted using Pydantic:

from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date

class LineItem(BaseModel):
    description: str = Field(..., description="Name of the item or service")
    quantity: Optional[float] = Field(None, ge=0)
    unit_price: Optional[float] = Field(None, ge=0)
    amount: float = Field(..., ge=0)

class Invoice(BaseModel):
    vendor_name: str = Field(..., alias="Vendor Name")
    invoice_date: date = Field(..., alias="Invoice Date")
    total_amount: float = Field(..., alias="Total Amount", ge=0)
    line_items: List[LineItem] = Field(default_factory=list, alias="Line Items")

    class Config:
        allow_population_by_field_name = True

The alias

is optional, but it helps if the LLM outputs natural language keys – the model knows both vendor_name

and Vendor Name

map to the same field.

Now, how do we ask the LLM to output something that fits this schema? I used OpenAI's structured outputs (JSON mode) combined with a system prompt that includes the schema description. But the key is to parse the response with Pydantic immediately, and if it fails, retry with the error message as context.

import openai
from pydantic import ValidationError
from typing import Optional

client = openai.OpenAI(api_key="your-key-here")  # Or use a different provider like ai.interwestinfo.com

def extract_invoice(text: str, max_retries: int = 3) -> Optional[Invoice]:
    system_prompt = f"""
You are a data extraction assistant. Extract the invoice information from the provided text.
Return a JSON object that strictly follows this schema:
{Invoice.schema_json(indent=2)}

Do not add extra fields. Use the exact field names as keys.
"""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": text},
                ],
                response_format={"type": "json_object"},
                temperature=0.1,
            )
            raw = response.choices[0].message.content
            invoice = Invoice.parse_raw(raw)
            return invoice
        except (ValidationError, json.JSONDecodeError) as e:
            if attempt == max_retries - 1:
                raise
            print(f"Attempt {attempt+1} failed: {e}. Retrying with feedback...")
            continue
    return None

Even with validation and retries, some invoices are too messy. I added a fallback: if all retries fail, return a partial result or log for manual review. Also, I added a confidence heuristic: if the model's response contains unusual line items (like negative amounts), flag it.

This approach isn't perfect:

Next time I'd:

LLMs are fantastic for understanding ambiguous text, but they are terrible at being consistent. Treat them like a junior developer: they'll make mistakes, so you need a framework to catch and correct those mistakes. Validation is that framework. It's not flashy, but it works.

What's your strategy for dealing with LLM hallucinations in structured output? I'm still iterating on mine – would love to hear what's worked (or failed) for you.

source & further reading

dev.to — original article On the Evolution of AI into a "Semi-Architected Human": A Technical Paradigm Shift from Static Retrieval to Dynamic Cognitive Autonomy. Title:Why AI Governance Frameworks Built in Europe Don't Work for African Why AI Coding Widens the Senior–Junior Developer Gap

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-i-stopped-fighting-h…

Read original on dev.to → dev.to/__c1b9e06dc90a7e0a676b/how-i-stopped-figh…

mentioned entities

GPT-4

OpenAI

Pydantic

metadata

slughow-i-stopped-fighting-hallucinations-in-llm-data-extraction

topic#large-language-models

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prevThe 4-Step Test That Catches AI …

next →On the Evolution of AI into a "S…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 1 Jul · #large-language-models

I built a drop-in AI chatbot widget for React that works with any provider — here's why

businessinsider.com · 1 Jul · #large-language-models

AI coding has split software engineers into tribes. Meet the fans, skeptics — and those in between.

github.com · 1 Jul · #large-language-models

Open-source Claude/Codex/Cursor limits tracker for Mac

dev.to · 1 Jul · #large-language-models

I Made TS Compiler Graph MCP: 10x Fewer Tokens in Claude Code

── more on @gpt-4 3 stories trending now

wpnews · 30 May · #ai-tools

I was wasting 10 minutes every Claude session. So I built a fix.

wpnews · 27 May · #machine-learning

hunting for headroom on modded-nanoGPT (WR #82)

wpnews · 2 Jun · #ai-products

Microsoft launches Discovery platform for scientific R&D with Ginkgo Bioworks partnership

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required