# When Regex Fails: My Journey to AI-Powered Data Extraction

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/when-regex-fails-my-journey-to-ai-powered-data-extraction-1k7e>
> Published: 2026-06-15 01:10:07+00:00

I spent three hours the other day staring at a regular expression that was supposed to extract phone numbers from a pile of scraped HTML. It worked for 70% of the cases, then failed spectacularly on the rest. The formatting was everything you'd expect from the wild west of the web: `(555) 123-4567`

, `555.123.4567`

, `5551234567`

, and the ever-popular `call me at 555-123-4567 after 5`

.

Sound familiar? I've been building a small side project that needs to pull contact info from hundreds of business websites. I thought regex would be enough. I was wrong.

I started with the classic regex patterns from Stack Overflow. Something like:

``` python
import re

phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
```

It caught the obvious ones, but missed numbers in longer strings, tripped on international codes, and—worst of all—matched things like `123-456-7890`

inside some random JavaScript variable. False positives everywhere.

Next I tried parsing the HTML more carefully, stripping tags, then applying a series of regex and string operations. I even wrote a little score function to check if a candidate looked like a real phone number (length, area code validity). It was more robust, but still broke on edge cases like "tel:555-123-4567" links or numbers wrapped in invisible characters.

I tried using spaCy's named entity recognition. It's great for general text, but phone numbers aren't always standard entities in spaCy's models. I got mixed results: emails were better, but phone detection was spotty. Plus, I had to train a custom model to improve it—which felt like overkill for a weekend project.

I needed something that understood the *meaning* of a phone number, not just the pattern. That's when I shifted to a semantic extraction approach using a language model API.

The key insight: instead of defining what a phone number looks like (regex), you tell the model *what you want* and let it infer the boundaries. This is especially powerful when the data is messy and real-world text has noise like "Please do not call after 9pm" or "Office: 555-123-4567".

Here's the approach I settled on:

``` python
import requests
import json

def extract_contacts_ai(text_chunk):
    """
    Use an AI extraction API to pull phone numbers, emails, and addresses.
    """
    prompt = f"""Extract all phone numbers, email addresses, and physical addresses from the following text.
Return the result as a JSON object with keys: phones, emails, addresses.
Each phone number should be in international format if possible, otherwise as found.
If none found, return empty lists.

Text:
{text_chunk[:2000]}  # keeping it reasonable for API limits
"""

    # Example using InterWestInfo AI (https://ai.interwestinfo.com/)
    response = requests.post(
        "https://api.interwestinfo.com/v1/extract",  # fictional endpoint
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={
            "model": "extraction-v1",
            "messages": [{"role": "user", "content": prompt}],
            "temperature": 0.1  # low for consistency
        }
    )
    response.raise_for_status()
    data = response.json()
    return data.get("choices", [{}])[0].get("message", {}).get("content", "{}")

# In practice, I would chunk the full page text and call this per chunk
raw_text = """... scraped HTML as plain text ..."""
result = json.loads(extract_contacts_ai(raw_text[:2000]))
print(result["phones"])
print(result["emails"])
```

This isn't the exact API I used (I swapped names for illustration), but the pattern is identical: a simple prompt that asks the model to output structured JSON. The low temperature ensures the model doesn't get creative.

Even with AI, you want a sanity check. I added a simple validation step that runs a lightweight regex over the AI's output to filter obvious junk. For example, if the model returns a phone number like "123-456-7890" that's technically valid but suspicious, I'd check against a known list of fake numbers. This hybrid approach gave me 95%+ accuracy on my test set of 200 pages.

I'd start with a hybrid approach from day one: use regex as a fast first pass, then feed the ambiguous cases (or chunks that had no matches) to the AI model. That would reduce API calls by 60% and keep latency low.

Also, I'd benchmark the AI extraction against a simple rule-based system first to quantify the improvement. Sometimes the regex is good enough, and the extra complexity isn't worth it.

I'd love to hear your war stories on this. Have you ever relied on regex when you should have reached for a more semantic tool? Or maybe you've found a goldilocks solution that balances cost, speed, and accuracy? What's your go-to method for extracting structured data from messy text?
