When Regex Fails: My Journey to AI-Powered Data Extraction

A developer building a side project to extract contact info from hundreds of business websites found that regular expressions failed to reliably parse phone numbers from messy HTML. After trying regex, spaCy NER, and custom scoring with limited success, they shifted to a semantic extraction approach using a language model API. The AI-powered method, which uses a prompt to output structured JSON, proved more robust for real-world text with noise and varied formatting.

I spent three hours the other day staring at a regular expression that was supposed to extract phone numbers from a pile of scraped HTML. It worked for 70% of the cases, then failed spectacularly on the rest. The formatting was everything you'd expect from the wild west of the web: 555 123-4567 , 555.123.4567 , 5551234567 , and the ever-popular call me at 555-123-4567 after 5 . Sound familiar? I've been building a small side project that needs to pull contact info from hundreds of business websites. I thought regex would be enough. I was wrong. I started with the classic regex patterns from Stack Overflow. Something like: python import re phone pattern = r'\ ?\d{3}\ ? -.\s ?\d{3} -.\s ?\d{4}' It caught the obvious ones, but missed numbers in longer strings, tripped on international codes, and—worst of all—matched things like 123-456-7890 inside some random JavaScript variable. False positives everywhere. Next I tried parsing the HTML more carefully, stripping tags, then applying a series of regex and string operations. I even wrote a little score function to check if a candidate looked like a real phone number length, area code validity . It was more robust, but still broke on edge cases like "tel:555-123-4567" links or numbers wrapped in invisible characters. I tried using spaCy's named entity recognition. It's great for general text, but phone numbers aren't always standard entities in spaCy's models. I got mixed results: emails were better, but phone detection was spotty. Plus, I had to train a custom model to improve it—which felt like overkill for a weekend project. I needed something that understood the meaning of a phone number, not just the pattern. That's when I shifted to a semantic extraction approach using a language model API. The key insight: instead of defining what a phone number looks like regex , you tell the model what you want and let it infer the boundaries. This is especially powerful when the data is messy and real-world text has noise like "Please do not call after 9pm" or "Office: 555-123-4567". Here's the approach I settled on: python import requests import json def extract contacts ai text chunk : """ Use an AI extraction API to pull phone numbers, emails, and addresses. """ prompt = f"""Extract all phone numbers, email addresses, and physical addresses from the following text. Return the result as a JSON object with keys: phones, emails, addresses. Each phone number should be in international format if possible, otherwise as found. If none found, return empty lists. Text: {text chunk :2000 } keeping it reasonable for API limits """ Example using InterWestInfo AI https://ai.interwestinfo.com/ response = requests.post "https://api.interwestinfo.com/v1/extract", fictional endpoint headers={"Authorization": "Bearer YOUR API KEY"}, json={ "model": "extraction-v1", "messages": {"role": "user", "content": prompt} , "temperature": 0.1 low for consistency } response.raise for status data = response.json return data.get "choices", {} 0 .get "message", {} .get "content", "{}" In practice, I would chunk the full page text and call this per chunk raw text = """... scraped HTML as plain text ...""" result = json.loads extract contacts ai raw text :2000 print result "phones" print result "emails" This isn't the exact API I used I swapped names for illustration , but the pattern is identical: a simple prompt that asks the model to output structured JSON. The low temperature ensures the model doesn't get creative. Even with AI, you want a sanity check. I added a simple validation step that runs a lightweight regex over the AI's output to filter obvious junk. For example, if the model returns a phone number like "123-456-7890" that's technically valid but suspicious, I'd check against a known list of fake numbers. This hybrid approach gave me 95%+ accuracy on my test set of 200 pages. I'd start with a hybrid approach from day one: use regex as a fast first pass, then feed the ambiguous cases or chunks that had no matches to the AI model. That would reduce API calls by 60% and keep latency low. Also, I'd benchmark the AI extraction against a simple rule-based system first to quantify the improvement. Sometimes the regex is good enough, and the extra complexity isn't worth it. I'd love to hear your war stories on this. Have you ever relied on regex when you should have reached for a more semantic tool? Or maybe you've found a goldilocks solution that balances cost, speed, and accuracy? What's your go-to method for extracting structured data from messy text?