A Practical Guide To Entity Resolution in Python (No Database, No Machine Learning) A developer achieved a 100% join rate on a 96-record Crunchbase-to-CRM dataset by replacing exact string matching with fuzzy matching using the RapidFuzz library, up from a 58% rate with normalized exact matching. The pipeline, built without databases or machine learning, used RapidFuzz's `fuzz.WRatio` function with a threshold of 90 to match company name variants like "Necker FinTech" and "Necker FinTech Holdings Inc." that exact equality checks would miss. TL;DR: Learn a very simple way to normalize, dedupe, and fuzzy-match records that refer to the same real-world entity in Python, without a database or any ML pipelines. I was working on a Crunchbase dataset last Friday. I joined it against our CRM, and got 56 hits out of 96. The other 40 were sitting right there in both tables — Necker FinTech in the extracted data was Necker FinTech Holdings Inc. in the CRM; Investing.com in the data was Fusion Media Limited in the CRM — but JOIN ... ON name = name obviously doesn't care, it will shrug and return nothing. If I'd shipped that, some sales rep would end up cold-pitching an existing customer because of it. 😅 This is the core problem of entity resolution https://en.wikipedia.org/wiki/Record linkage : the same real-world entity wearing different names in different systems. Naive text equality checks are borderline useless in the real world. I’d been meaning to do something less embarrassing than a raw == for a while, so I spent the rest of the weekend on a simple pipeline — scrape company names from Crunchbase hubs via Bright Data https://get.brightdata.com/bd7914?utm content=a practical guide to entity resolution in python no database no machine learning , normalize, deduplicate, and fuzzy-match against the CRM list using RapidFuzz https://github.com/rapidfuzz/RapidFuzz fuzz.WRatio . Deliberately choosing to NOT use ML, vector embeddings, or a database. The join rate on this dataset jumped from ~58% to 100% . | Metric | Exact normalized string | Fuzzy WRatio ≥ 90 | |---|---|---| | Scraped hub rows → CRM | 58.3% 56 / 96 | 100% 96 / 96 | | CRM rows → scraped data | 34.8% 48 / 138 | 100% 138 / 138 | The reason exact matching loses so badly is that any real CRM list you’re handed will almost always have multiple legal-name variants per company — I had three different Necker spellings pointing at one hub listing alone. Fuzzy matching earns its keep by collapsing those variants back into a single canonical cluster, and that’s most of what the rest of this post is about. I’ll walk through it; I hope it’s useful for anyone starting with fuzzy algorithms Entity resolution matches records that describe the same company under different surface strings. If you use exact matching, you ask : are these two strings identical? After you lowercase and strip punctuation, "Necker FinTech" and "Necker FinTech Holdings Inc." are still different strings — so a SQL JOIN or a Python == check will incorrectly say no match. -- Exact join on raw names returns no row when spellings differ SELECT h.company name AS hub name, c.company name AS crm name FROM hub scrape h JOIN crm accounts c ON c.company name = h.company name WHERE h.company name = 'Necker FinTech'; -- This will return 0 rows -- Remember, CRM has "Necker FinTech Holdings Inc.", not the Crunchbase title This is why you use Fuzzy matching. That asks a looser question: how similar are these two strings ? You get a score — usually 0 to 100 — instead of true or false . Names that are clearly the same company but spelled differently Necker FinTech vs Necker FinTech Holdings Inc. will score high, while unrelated names will score low. You pick a threshold we use 90 : if the score is at or above it, you treat the pair as a match; otherwise you don't. php from rapidfuzz import fuzz THRESHOLD = 90 def is match a: str, b: str - bool: return fuzz.WRatio a, b = THRESHOLD pairs = "Necker FinTech", "Necker FinTech Holdings Inc." , same company, legal suffix "PointsKash", "Points Kash" , same company, spacing "Investing.com", "Fusion Media Limited" , brand vs legal entity "Stripe", "Climate Corp" , different companies for a, b in pairs: score = fuzz.WRatio a, b print f"{score:5.1f} match={score = THRESHOLD s:5} {a r} vs {b r}" This is the same scoring logic we’ll use for the rest of the tutorial, so pip install rapidfuzz is all you need to follow along. Running the demo pairs above with fuzz.WRatio and WRatio threshold 90 yields: | Pair | WRatio | Match at ≥ 90? | Drift type | |---|---|---|---| Necker FinTech vs Necker FinTech Holdings Inc. | 90.0 | Yes | Legal suffix | PointsKash vs Points Kash | 95.2 | Yes | Token spacing | Investing.com vs Fusion Media Limited | 30.0 | No | Brand vs legal entity | Stripe vs Climate Corp | 45.0 | No | Unrelated companies | Think of it like a strict spell-check or a “did you mean X?” suggestion, but for whole company names. It is not machine learning — no model is trained on your data. The library compares characters and words using fixed rules: how many edits to turn one string into another, whether one name is contained in the other, whether the same words appear in a different order. That’s why it’s fast, easy to audit, and good enough for a large class of real-world messiness — extra words, Inc. vs LLC , odd spacing, punctuation. 💡If two names share almost no letters —Investing.comandFusion Media Limitedfor example — the score stays low and fuzzy matching correctly refuses to merge them. Those cases need a real identifier domain, LEI, enrichment API, some sort of ML pipeline etc. , not smarter string math. Here’s a quick summary. | Approach | Best when | Used in this pipeline? | |---|---|---| Fuzzy matching RapidFuzz WRatio | Same entity, stylistic drift — legal suffixes, spacing, punctuation | Yes — primary method | Lookup table / enrichment API | Brand vs legal name; names share almost no tokens | Partial — RESEARCHED dict in build sample crm.py | | GLEIF, Clearbit, domain | || ML record linkage Dedupe, Splink | Large-scale probabilistic linkage, many fields beyond name | No — names-only, no training step | Basically, choose fuzzy matching when two name strings likely describe the same company but spell it differently. Only choose a lookup or enrichment layer when the strings are related entities brand vs operator rather than variants of one name. Entity resolution in this pipeline is a fetch → extract → normalize → fuzzy-cluster → join loop on canonical id . hub urls.json │ ▼ fetch hubs.py ──calls──► bright data unlocker.py Bright Data POST → page body markdown/HTML │ │ └──calls──► parse hubs.py ◄─┘ regex → org slug + display name │ ▼ hub snapshot.json + cached bodies in data/hub responses/ extract.py ──► raw records.json flat table reconcile.py ──► reconciled.json canonical clusters + aliases run fuzzy.py CLI part. This just runs extract + reconcile ── optional eval ── post fuzzy eval.py All done, so run a real-world test, calc metrics, then print to stdout Each stage is a pure transform: JSON in, JSON out. Nothing stateful, nothing that requires a running service, and nothing I can't git diff between runs. I’m scraping four Crunchbase hub leaderboard pages, defined in a hub urls.json : { "category": "fintech", "url": "https://www.crunchbase.com/hub/fintech-companies-seed-funding" }, { "category": "cybersecurity", "url": "https://www.crunchbase.com/hub/cyber-security-startups" }, { "category": "saas", "url": "https://www.crunchbase.com/hub/saas-companies-seed-funding" }, { "category": "artificial intelligence","url": "https://www.crunchbase.com/hub/artificial-intelligence-companies-early-stage-venture-funding" } Replace with your own, obviously. Crunchbase is a JavaScript-heavy SPA — it won’t respond to a plain requests.get . So before we fetch, I use Bright Data's Web Unlocker https://get.brightdata.com/bd-web-unlocker?utm content=a practical guide to entity resolution in python no database no machine learning , which handles JS rendering and anti-bot for me. Sign up here -- Automated Web Unblocker https://get.brightdata.com/bd-web-unlocker?utm content=a practical guide to entity resolution in python no database no machine learning&source=post page-----89d55badaeac--------------------------------------- I set up a reusable client for this, and this is just a thin wrapper around their single POST endpoint https://api.brightdata.com/request . Make sure you’ve signed up, and have these set in your .env file first: BRIGHTDATA API TOKEN=your api token BRIGHTDATA ZONE=your web unlocker zone name bright data unlocker.py python """Fetch hub/listing pages as HTML or markdown.""" from future import annotations import json import os import time from typing import Any, Dict, Literal, Optional import requests from dotenv import load dotenv load dotenv ContentFormat = Literal "html", "markdown" class BrightDataUnlockerClient: """POST https://api.brightdata.com/request Web Unlocker zone .""" def init self, api key: Optional str = None, zone: Optional str = None, country: Optional str = None, : self.api key = api key or os.getenv "BRIGHT DATA API KEY" self.zone = zone or os.getenv "BRIGHT DATA UNLOCKER ZONE" self.country = country or os.getenv "BRIGHT DATA COUNTRY" optional self.api endpoint = "https://api.brightdata.com/request" if not self.api key: raise ValueError "BRIGHT DATA API KEY is required." if not self.zone: raise ValueError "BRIGHT DATA UNLOCKER ZONE is required. " "Create a Web Unlocker API zone in Bright Data." self.session = requests.Session self.session.headers.update { "Content-Type": "application/json", "Authorization": f"Bearer {self.api key}", } def fetch self, url: str, , content format: ContentFormat = "markdown", max retries: int = 2, - str: """Fetch page body. markdown = format=raw + data format=markdown Bright Data .""" last err: Optional Exception = None for attempt in range max retries + 1 : try: return self. do fetch url, content format=content format except Exception as e: last err = e if attempt < max retries: time.sleep 0.5 attempt + 1 assert last err is not None raise last err def fetch markdown self, url: str, max retries: int = 2 - str: return self.fetch url, content format="markdown", max retries=max retries def fetch html self, url: str, max retries: int = 2 - str: return self.fetch url, content format="html", max retries=max retries def do fetch self, url: str, , content format: ContentFormat - str: payload: Dict str, Any = { "zone": self.zone, "url": url, "format": "raw", } if content format == "markdown": payload "data format" = "markdown" if self.country: payload "country" = self.country response = self.session.post self.api endpoint, json=payload, timeout=120 response.raise for status try: result = response.json except json.JSONDecodeError: data format=markdown often returns the page body directly, not a JSON envelope text = response.text if not text.strip : raise RuntimeError "Bright Data Unlocker empty response body" return text if not isinstance result, dict : raise RuntimeError f"Bright Data unexpected response type: {type result }" inner status = result.get "status code" if inner status is not None and inner status = 200: raise RuntimeError f"Bright Data Unlocker status code={inner status}" body = result.get "body" if body is None: if "status code" in result and result.get "status code" == 200: raise RuntimeError "Bright Data Unlocker empty body" raise RuntimeError f"Bright Data Unlocker missing body: {list result.keys }" if isinstance body, str : if body.strip .startswith "{" : try: nested = json.loads body if isinstance nested, dict and "body" in nested: body = nested "body" except json.JSONDecodeError: pass if not str body .strip : raise RuntimeError "Bright Data Unlocker empty body string" return str body if isinstance body, dict : return json.dumps body return str body Note how we can request data format=markdown . Using this param, Bright Data returns a sanitized markdown rendering of the page, which is much easier to parse with regex than raw HTML. 💡 If markdown still yields zero orgs for a hub, fetch hubs.py --fallback-html can fetch or use cached HTML and run the HTML parser instead. With that in place, here’s our actual fetch script — fetch hubs.py fetch hubs.py """Fetch Crunchbase hub pages via Bright Data Web Unlocker; write hub snapshot.json.""" from future import annotations import argparse import json import time from datetime import datetime, timezone from pathlib import Path from typing import Any, Dict, List, Optional from dotenv import load dotenv from bright data unlocker import BrightDataUnlockerClient, ContentFormat from parse hubs import parse organizations load dotenv ROOT = Path file .resolve .parent DEFAULT RESPONSES DIR = ROOT / "data" / "hub responses" def load hub urls path: Path - List Dict str, str : raw = json.loads path.read text encoding="utf-8" if not isinstance raw, list : raise ValueError "hub urls.json must be a JSON array" out: List Dict str, str = for item in raw: if not isinstance item, dict : continue url = item.get "url" or "" .strip category = item.get "category" or "unknown" .strip if url: out.append {"category": category, "url": url} return out def response file category: str, content format: ContentFormat - str: ext = "md" if content format == "markdown" else "html" safe = "".join c if c.isalnum or c in "- " else " " for c in category return f"{safe}.{ext}" def response path responses dir: Path, category: str, content format: ContentFormat - Path: return responses dir / response file category, content format def load cached body responses dir: Path, category: str, content format: ContentFormat - Optional str : path = response path responses dir, category, content format if not path.is file or path.stat .st size == 0: return None return path.read text encoding="utf-8" def save response body responses dir: Path, category: str, hub url: str, content format: ContentFormat, body: str, - Path: responses dir.mkdir parents=True, exist ok=True path = response path responses dir, category, content format path.write text body, encoding="utf-8" return path def manifest path responses dir: Path - Path: return responses dir / "manifest.json" def load manifest responses dir: Path - Dict str, Any : path = manifest path responses dir if not path.is file : return {"hubs": } return json.loads path.read text encoding="utf-8" def upsert manifest entry responses dir: Path, category: str, hub url: str, content format: ContentFormat, response path: Path, , fetched at: str, - None: entry = { "category": category, "hub url": hub url, "content format": content format, "response file": response path.name, "fetched at": fetched at, } manifest = load manifest responses dir hubs = h for h in manifest.get "hubs" or if h.get "category" = category hubs.append entry manifest "hubs" = hubs manifest "updated at" = datetime.now timezone.utc .isoformat manifest path responses dir .write text json.dumps manifest, indent=2, ensure ascii=False + "\n", encoding="utf-8", def parse body body: str, hub url: str, content format: ContentFormat, max orgs: int, - List Dict str, Any : return parse organizations body, hub url, content format=content format, max orgs=max orgs def main - None: ap = argparse.ArgumentParser description="Fetch Crunchbase hub pages Web Unlocker and extract organization URLs.", ap.add argument "--hubs-json", type=Path, default= ROOT / "hub urls.json" ap.add argument "--out", type=Path, default= ROOT / "data" / "hub snapshot.json" ap.add argument "--format", choices= "markdown", "html" , default="markdown", ap.add argument "--max-orgs-per-hub", type=int, default=80 ap.add argument "--delay", type=float, default=1.0 ap.add argument "--responses-dir", type=Path, default= DEFAULT RESPONSES DIR, help="Directory for cached raw hub page bodies default: data/hub responses .", ap.add argument "--refetch", action="store true", help="Call Bright Data even if a cached response file exists.", ap.add argument "--parse-only", action="store true", help="Parse cached responses only; never call Bright Data.", ap.add argument "--fallback-html", action="store true", help="If markdown parse finds 0 orgs, try cached or fetched HTML.", args = ap.parse args responses dir = args.responses dir hubs = load hub urls args.hubs json if not hubs: raise SystemExit "No hubs in hub urls.json" args.out.parent.mkdir parents=True, exist ok=True client: Optional BrightDataUnlockerClient = None if not args.parse only: client = BrightDataUnlockerClient content format: ContentFormat = args.format payload: Dict str, Any = { "fetched at": datetime.now timezone.utc .isoformat , "source": "bright data web unlocker", "content format": content format, "responses dir": str responses dir , "hubs": , } n hubs = len hubs for i, hub in enumerate hubs, start=1 : category = hub "category" url = hub "url" print f"\n {i}/{n hubs} hub {category} : starting...", flush=True block: Dict str, Any = { "category": category, "hub url": url, "error": None, "organic count": 0, "rows": , "response file": response file category, content format , } parse format: ContentFormat = content format try: body: Optional str = None if not args.refetch: body = load cached body responses dir, category, content format if body is None: if args.parse only: raise FileNotFoundError f"no cached response at { response path responses dir, category, content format } " " run without --parse-only to fetch " print f" {i}/{n hubs} hub {category} : fetching {content format} ...", flush=True, assert client is not None body = client.fetch url, content format=content format print f" {i}/{n hubs} hub {category} : fetch done " f" {len body :,} chars ", flush=True, saved = save response body responses dir, category, url, content format, body upsert manifest entry responses dir, category, url, content format, saved, fetched at=datetime.now timezone.utc .isoformat , print f" {i}/{n hubs} hub {category} : saved {saved}", flush=True else: print f" {i}/{n hubs} hub {category} : using cache " f"{ response path responses dir, category, content format }", flush=True, print f" {i}/{n hubs} hub {category} : parsing...", flush=True rows = parse body body, url, parse format, args.max orgs per hub if not rows and args.fallback html and parse format == "markdown": html body = load cached body responses dir, category, "html" if html body is None and not args.parse only: print f" {i}/{n hubs} hub {category} : markdown had 0 orgs, " "fetching HTML...", flush=True, assert client is not None html body = client.fetch url, content format="html" print f" {i}/{n hubs} hub {category} : HTML fetch done " f" {len html body :,} chars ", flush=True, saved = save response body responses dir, category, url, "html", html body print f" {i}/{n hubs} hub {category} : saved {saved}", flush=True elif html body is None: raise FileNotFoundError f"no cached HTML at { response path responses dir, category, 'html' }" else: print f" {i}/{n hubs} hub {category} : markdown had 0 orgs, " "using cached HTML...", flush=True, print f" {i}/{n hubs} hub {category} : parsing HTML...", flush=True rows = parse body html body, url, "html", args.max orgs per hub parse format = "html" block "response file" = response file category, "html" block "content format" = parse format block "organic count" = len rows block "rows" = rows print f" {i}/{n hubs} hub {category} : done - " f"{len rows } organizations", flush=True, except Exception as e: print f" {i}/{n hubs} hub {category} : failed - {e}", flush=True block "error" = str e payload "hubs" .append block if not args.parse only: time.sleep args.delay args.out.write text json.dumps payload, indent=2, ensure ascii=False + "\n", encoding="utf-8", total = sum h.get "organic count" or 0 for h in payload "hubs" print f"\nAll hubs processed. Wrote {args.out} " f" {total} organizations across {n hubs} hubs .", flush=True, if name == " main ": main Note how I’m caching the raw bodies under data/hub responses/ so re-runs with --parse-only don't burn any API credits. Our parse hubs.py pulls organization slugs and display names out of the cached page bodies from the previous step. It runs three regex patterns in priority order: parse hubs.py Priority 1: Bright Data relative markdown links Matches: /organization/slug "Display Name" ORG REL LINK = re.compile r"\ \ /organization/ a-z0-9 - + ?:\s+\" ^\" \" ?\s \ ", re.I, Priority 2: Standard absolute markdown links Matches: Company Name https://www.crunchbase.com/organization/slug ORG MD LINK = re.compile r"\ ^\ + \ \ \s