A Practical Guide To Entity Resolution in Python (No Database, No Machine Learning)

A developer achieved a 100% join rate on a 96-record Crunchbase-to-CRM dataset by replacing exact string matching with fuzzy matching using the RapidFuzz library, up from a 58% rate with normalized exact matching. The pipeline, built without databases or machine learning, used RapidFuzz's `fuzz.WRatio` function with a threshold of 90 to match company name variants like "Necker FinTech" and "Necker FinTech Holdings Inc." that exact equality checks would miss.

TL;DR: Learn a very simple way to normalize, dedupe, and fuzzy-match records that refer to the same real-world entity in Python, without a database or any ML pipelines. I was working on a Crunchbase dataset last Friday. I joined it against our CRM, and got 56 hits out of 96. The other 40 were sitting right there in both tables — Necker FinTech in the extracted data was Necker FinTech Holdings Inc. in the CRM; Investing.com in the data was Fusion Media Limited in the CRM — but JOIN ... ON name = name obviously doesn't care, it will shrug and return nothing. If I'd shipped that, some sales rep would end up cold-pitching an existing customer because of it. 😅 This is the core problem of entity resolution https://en.wikipedia.org/wiki/Record linkage : the same real-world entity wearing different names in different systems. Naive text equality checks are borderline useless in the real world. I’d been meaning to do something less embarrassing than a raw == for a while, so I spent the rest of the weekend on a simple pipeline — scrape company names from Crunchbase hubs via Bright Data https://get.brightdata.com/bd7914?utm content=a practical guide to entity resolution in python no database no machine learning , normalize, deduplicate, and fuzzy-match against the CRM list using RapidFuzz https://github.com/rapidfuzz/RapidFuzz fuzz.WRatio . Deliberately choosing to NOT use ML, vector embeddings, or a database. The join rate on this dataset jumped from ~58% to 100% . | Metric | Exact normalized string | Fuzzy WRatio ≥ 90 | |---|---|---| | Scraped hub rows → CRM | 58.3% 56 / 96 | 100% 96 / 96 | | CRM rows → scraped data | 34.8% 48 / 138 | 100% 138 / 138 | The reason exact matching loses so badly is that any real CRM list you’re handed will almost always have multiple legal-name variants per company — I had three different Necker spellings pointing at one hub listing alone. Fuzzy matching earns its keep by collapsing those variants back into a single canonical cluster, and that’s most of what the rest of this post is about. I’ll walk through it; I hope it’s useful for anyone starting with fuzzy algorithms Entity resolution matches records that describe the same company under different surface strings. If you use exact matching, you ask : are these two strings identical? After you lowercase and strip punctuation, "Necker FinTech" and "Necker FinTech Holdings Inc." are still different strings — so a SQL JOIN or a Python == check will incorrectly say no match. -- Exact join on raw names returns no row when spellings differ SELECT h.company name AS hub name, c.company name AS crm name FROM hub scrape h JOIN crm accounts c ON c.company name = h.company name WHERE h.company name = 'Necker FinTech'; -- This will return 0 rows -- Remember, CRM has "Necker FinTech Holdings Inc.", not the Crunchbase title This is why you use Fuzzy matching. That asks a looser question: how similar are these two strings ? You get a score — usually 0 to 100 — instead of true or false . Names that are clearly the same company but spelled differently Necker FinTech vs Necker FinTech Holdings Inc. will score high, while unrelated names will score low. You pick a threshold we use 90 : if the score is at or above it, you treat the pair as a match; otherwise you don't. php from rapidfuzz import fuzz THRESHOLD = 90 def is match a: str, b: str - bool: return fuzz.WRatio a, b = THRESHOLD pairs = "Necker FinTech", "Necker FinTech Holdings Inc." , same company, legal suffix "PointsKash", "Points Kash" , same company, spacing "Investing.com", "Fusion Media Limited" , brand vs legal entity "Stripe", "Climate Corp" , different companies for a, b in pairs: score = fuzz.WRatio a, b print f"{score:5.1f} match={score = THRESHOLD s:5} {a r} vs {b r}" This is the same scoring logic we’ll use for the rest of the tutorial, so pip install rapidfuzz is all you need to follow along. Running the demo pairs above with fuzz.WRatio and WRatio threshold 90 yields: | Pair | WRatio | Match at ≥ 90? | Drift type | |---|---|---|---| Necker FinTech vs Necker FinTech Holdings Inc. | 90.0 | Yes | Legal suffix | PointsKash vs Points Kash | 95.2 | Yes | Token spacing | Investing.com vs Fusion Media Limited | 30.0 | No | Brand vs legal entity | Stripe vs Climate Corp | 45.0 | No | Unrelated companies | Think of it like a strict spell-check or a “did you mean X?” suggestion, but for whole company names. It is not machine learning — no model is trained on your data. The library compares characters and words using fixed rules: how many edits to turn one string into another, whether one name is contained in the other, whether the same words appear in a different order. That’s why it’s fast, easy to audit, and good enough for a large class of real-world messiness — extra words, Inc. vs LLC , odd spacing, punctuation. 💡If two names share almost no letters —Investing.comandFusion Media Limitedfor example — the score stays low and fuzzy matching correctly refuses to merge them. Those cases need a real identifier domain, LEI, enrichment API, some sort of ML pipeline etc. , not smarter string math. Here’s a quick summary. | Approach | Best when | Used in this pipeline? | |---|---|---| Fuzzy matching RapidFuzz WRatio | Same entity, stylistic drift — legal suffixes, spacing, punctuation | Yes — primary method | Lookup table / enrichment API | Brand vs legal name; names share almost no tokens | Partial — RESEARCHED dict in build sample crm.py | | GLEIF, Clearbit, domain | || ML record linkage Dedupe, Splink | Large-scale probabilistic linkage, many fields beyond name | No — names-only, no training step | Basically, choose fuzzy matching when two name strings likely describe the same company but spell it differently. Only choose a lookup or enrichment layer when the strings are related entities brand vs operator rather than variants of one name. Entity resolution in this pipeline is a fetch → extract → normalize → fuzzy-cluster → join loop on canonical id . hub urls.json │ ▼ fetch hubs.py ──calls──► bright data unlocker.py Bright Data POST → page body markdown/HTML │ │ └──calls──► parse hubs.py ◄─┘ regex → org slug + display name │ ▼ hub snapshot.json + cached bodies in data/hub responses/ extract.py ──► raw records.json flat table reconcile.py ──► reconciled.json canonical clusters + aliases run fuzzy.py CLI part. This just runs extract + reconcile ── optional eval ── post fuzzy eval.py All done, so run a real-world test, calc metrics, then print to stdout Each stage is a pure transform: JSON in, JSON out. Nothing stateful, nothing that requires a running service, and nothing I can't git diff between runs. I’m scraping four Crunchbase hub leaderboard pages, defined in a hub urls.json : { "category": "fintech", "url": "https://www.crunchbase.com/hub/fintech-companies-seed-funding" }, { "category": "cybersecurity", "url": "https://www.crunchbase.com/hub/cyber-security-startups" }, { "category": "saas", "url": "https://www.crunchbase.com/hub/saas-companies-seed-funding" }, { "category": "artificial intelligence","url": "https://www.crunchbase.com/hub/artificial-intelligence-companies-early-stage-venture-funding" } Replace with your own, obviously. Crunchbase is a JavaScript-heavy SPA — it won’t respond to a plain requests.get . So before we fetch, I use Bright Data's Web Unlocker https://get.brightdata.com/bd-web-unlocker?utm content=a practical guide to entity resolution in python no database no machine learning , which handles JS rendering and anti-bot for me. Sign up here -- Automated Web Unblocker https://get.brightdata.com/bd-web-unlocker?utm content=a practical guide to entity resolution in python no database no machine learning&source=post page-----89d55badaeac--------------------------------------- I set up a reusable client for this, and this is just a thin wrapper around their single POST endpoint https://api.brightdata.com/request . Make sure you’ve signed up, and have these set in your .env file first: BRIGHTDATA API TOKEN=your api token BRIGHTDATA ZONE=your web unlocker zone name bright data unlocker.py python """Fetch hub/listing pages as HTML or markdown.""" from future import annotations import json import os import time from typing import Any, Dict, Literal, Optional import requests from dotenv import load dotenv load dotenv ContentFormat = Literal "html", "markdown" class BrightDataUnlockerClient: """POST https://api.brightdata.com/request Web Unlocker zone .""" def init self, api key: Optional str = None, zone: Optional str = None, country: Optional str = None, : self.api key = api key or os.getenv "BRIGHT DATA API KEY" self.zone = zone or os.getenv "BRIGHT DATA UNLOCKER ZONE" self.country = country or os.getenv "BRIGHT DATA COUNTRY" optional self.api endpoint = "https://api.brightdata.com/request" if not self.api key: raise ValueError "BRIGHT DATA API KEY is required." if not self.zone: raise ValueError "BRIGHT DATA UNLOCKER ZONE is required. " "Create a Web Unlocker API zone in Bright Data." self.session = requests.Session self.session.headers.update { "Content-Type": "application/json", "Authorization": f"Bearer {self.api key}", } def fetch self, url: str, , content format: ContentFormat = "markdown", max retries: int = 2, - str: """Fetch page body. markdown = format=raw + data format=markdown Bright Data .""" last err: Optional Exception = None for attempt in range max retries + 1 : try: return self. do fetch url, content format=content format except Exception as e: last err = e if attempt < max retries: time.sleep 0.5 attempt + 1 assert last err is not None raise last err def fetch markdown self, url: str, max retries: int = 2 - str: return self.fetch url, content format="markdown", max retries=max retries def fetch html self, url: str, max retries: int = 2 - str: return self.fetch url, content format="html", max retries=max retries def do fetch self, url: str, , content format: ContentFormat - str: payload: Dict str, Any = { "zone": self.zone, "url": url, "format": "raw", } if content format == "markdown": payload "data format" = "markdown" if self.country: payload "country" = self.country response = self.session.post self.api endpoint, json=payload, timeout=120 response.raise for status try: result = response.json except json.JSONDecodeError: data format=markdown often returns the page body directly, not a JSON envelope text = response.text if not text.strip : raise RuntimeError "Bright Data Unlocker empty response body" return text if not isinstance result, dict : raise RuntimeError f"Bright Data unexpected response type: {type result }" inner status = result.get "status code" if inner status is not None and inner status = 200: raise RuntimeError f"Bright Data Unlocker status code={inner status}" body = result.get "body" if body is None: if "status code" in result and result.get "status code" == 200: raise RuntimeError "Bright Data Unlocker empty body" raise RuntimeError f"Bright Data Unlocker missing body: {list result.keys }" if isinstance body, str : if body.strip .startswith "{" : try: nested = json.loads body if isinstance nested, dict and "body" in nested: body = nested "body" except json.JSONDecodeError: pass if not str body .strip : raise RuntimeError "Bright Data Unlocker empty body string" return str body if isinstance body, dict : return json.dumps body return str body Note how we can request data format=markdown . Using this param, Bright Data returns a sanitized markdown rendering of the page, which is much easier to parse with regex than raw HTML. 💡 If markdown still yields zero orgs for a hub, fetch hubs.py --fallback-html can fetch or use cached HTML and run the HTML parser instead. With that in place, here’s our actual fetch script — fetch hubs.py fetch hubs.py """Fetch Crunchbase hub pages via Bright Data Web Unlocker; write hub snapshot.json.""" from future import annotations import argparse import json import time from datetime import datetime, timezone from pathlib import Path from typing import Any, Dict, List, Optional from dotenv import load dotenv from bright data unlocker import BrightDataUnlockerClient, ContentFormat from parse hubs import parse organizations load dotenv ROOT = Path file .resolve .parent DEFAULT RESPONSES DIR = ROOT / "data" / "hub responses" def load hub urls path: Path - List Dict str, str : raw = json.loads path.read text encoding="utf-8" if not isinstance raw, list : raise ValueError "hub urls.json must be a JSON array" out: List Dict str, str = for item in raw: if not isinstance item, dict : continue url = item.get "url" or "" .strip category = item.get "category" or "unknown" .strip if url: out.append {"category": category, "url": url} return out def response file category: str, content format: ContentFormat - str: ext = "md" if content format == "markdown" else "html" safe = "".join c if c.isalnum or c in "- " else " " for c in category return f"{safe}.{ext}" def response path responses dir: Path, category: str, content format: ContentFormat - Path: return responses dir / response file category, content format def load cached body responses dir: Path, category: str, content format: ContentFormat - Optional str : path = response path responses dir, category, content format if not path.is file or path.stat .st size == 0: return None return path.read text encoding="utf-8" def save response body responses dir: Path, category: str, hub url: str, content format: ContentFormat, body: str, - Path: responses dir.mkdir parents=True, exist ok=True path = response path responses dir, category, content format path.write text body, encoding="utf-8" return path def manifest path responses dir: Path - Path: return responses dir / "manifest.json" def load manifest responses dir: Path - Dict str, Any : path = manifest path responses dir if not path.is file : return {"hubs": } return json.loads path.read text encoding="utf-8" def upsert manifest entry responses dir: Path, category: str, hub url: str, content format: ContentFormat, response path: Path, , fetched at: str, - None: entry = { "category": category, "hub url": hub url, "content format": content format, "response file": response path.name, "fetched at": fetched at, } manifest = load manifest responses dir hubs = h for h in manifest.get "hubs" or if h.get "category" = category hubs.append entry manifest "hubs" = hubs manifest "updated at" = datetime.now timezone.utc .isoformat manifest path responses dir .write text json.dumps manifest, indent=2, ensure ascii=False + "\n", encoding="utf-8", def parse body body: str, hub url: str, content format: ContentFormat, max orgs: int, - List Dict str, Any : return parse organizations body, hub url, content format=content format, max orgs=max orgs def main - None: ap = argparse.ArgumentParser description="Fetch Crunchbase hub pages Web Unlocker and extract organization URLs.", ap.add argument "--hubs-json", type=Path, default= ROOT / "hub urls.json" ap.add argument "--out", type=Path, default= ROOT / "data" / "hub snapshot.json" ap.add argument "--format", choices= "markdown", "html" , default="markdown", ap.add argument "--max-orgs-per-hub", type=int, default=80 ap.add argument "--delay", type=float, default=1.0 ap.add argument "--responses-dir", type=Path, default= DEFAULT RESPONSES DIR, help="Directory for cached raw hub page bodies default: data/hub responses .", ap.add argument "--refetch", action="store true", help="Call Bright Data even if a cached response file exists.", ap.add argument "--parse-only", action="store true", help="Parse cached responses only; never call Bright Data.", ap.add argument "--fallback-html", action="store true", help="If markdown parse finds 0 orgs, try cached or fetched HTML.", args = ap.parse args responses dir = args.responses dir hubs = load hub urls args.hubs json if not hubs: raise SystemExit "No hubs in hub urls.json" args.out.parent.mkdir parents=True, exist ok=True client: Optional BrightDataUnlockerClient = None if not args.parse only: client = BrightDataUnlockerClient content format: ContentFormat = args.format payload: Dict str, Any = { "fetched at": datetime.now timezone.utc .isoformat , "source": "bright data web unlocker", "content format": content format, "responses dir": str responses dir , "hubs": , } n hubs = len hubs for i, hub in enumerate hubs, start=1 : category = hub "category" url = hub "url" print f"\n {i}/{n hubs} hub {category} : starting...", flush=True block: Dict str, Any = { "category": category, "hub url": url, "error": None, "organic count": 0, "rows": , "response file": response file category, content format , } parse format: ContentFormat = content format try: body: Optional str = None if not args.refetch: body = load cached body responses dir, category, content format if body is None: if args.parse only: raise FileNotFoundError f"no cached response at { response path responses dir, category, content format } " " run without --parse-only to fetch " print f" {i}/{n hubs} hub {category} : fetching {content format} ...", flush=True, assert client is not None body = client.fetch url, content format=content format print f" {i}/{n hubs} hub {category} : fetch done " f" {len body :,} chars ", flush=True, saved = save response body responses dir, category, url, content format, body upsert manifest entry responses dir, category, url, content format, saved, fetched at=datetime.now timezone.utc .isoformat , print f" {i}/{n hubs} hub {category} : saved {saved}", flush=True else: print f" {i}/{n hubs} hub {category} : using cache " f"{ response path responses dir, category, content format }", flush=True, print f" {i}/{n hubs} hub {category} : parsing...", flush=True rows = parse body body, url, parse format, args.max orgs per hub if not rows and args.fallback html and parse format == "markdown": html body = load cached body responses dir, category, "html" if html body is None and not args.parse only: print f" {i}/{n hubs} hub {category} : markdown had 0 orgs, " "fetching HTML...", flush=True, assert client is not None html body = client.fetch url, content format="html" print f" {i}/{n hubs} hub {category} : HTML fetch done " f" {len html body :,} chars ", flush=True, saved = save response body responses dir, category, url, "html", html body print f" {i}/{n hubs} hub {category} : saved {saved}", flush=True elif html body is None: raise FileNotFoundError f"no cached HTML at { response path responses dir, category, 'html' }" else: print f" {i}/{n hubs} hub {category} : markdown had 0 orgs, " "using cached HTML...", flush=True, print f" {i}/{n hubs} hub {category} : parsing HTML...", flush=True rows = parse body html body, url, "html", args.max orgs per hub parse format = "html" block "response file" = response file category, "html" block "content format" = parse format block "organic count" = len rows block "rows" = rows print f" {i}/{n hubs} hub {category} : done - " f"{len rows } organizations", flush=True, except Exception as e: print f" {i}/{n hubs} hub {category} : failed - {e}", flush=True block "error" = str e payload "hubs" .append block if not args.parse only: time.sleep args.delay args.out.write text json.dumps payload, indent=2, ensure ascii=False + "\n", encoding="utf-8", total = sum h.get "organic count" or 0 for h in payload "hubs" print f"\nAll hubs processed. Wrote {args.out} " f" {total} organizations across {n hubs} hubs .", flush=True, if name == " main ": main Note how I’m caching the raw bodies under data/hub responses/ so re-runs with --parse-only don't burn any API credits. Our parse hubs.py pulls organization slugs and display names out of the cached page bodies from the previous step. It runs three regex patterns in priority order: parse hubs.py Priority 1: Bright Data relative markdown links Matches: /organization/slug "Display Name" ORG REL LINK = re.compile r"\ \ /organization/ a-z0-9 - + ?:\s+\" ^\" \" ?\s \ ", re.I, Priority 2: Standard absolute markdown links Matches: Company Name https://www.crunchbase.com/organization/slug ORG MD LINK = re.compile r"\ ^\ + \ \ \s <?https?:// ^ \s crunchbase.com/organization/ a-z0-9 - + /? ?\s \ ", re.I, Fallback: bare /organization/slug anywhere in text ORG IN TEXT = re.compile r" ?:https?:// ^/\s crunchbase.com ?/organization/ a-z0-9 - + ", re.I, Each hub gets parsed into rows like: { "url": "https://www.crunchbase.com/organization/lovable", "slug": "lovable", "title": "Lovable" } Here’s the full code for parse hubs.py . Note that I also keep a blocklist of well-known VCs and accelerators y-combinator , techstars , andreessen-horowitz , etc. that show up on hub pages but are the investors , not the companies being listed. Without this, you get YC ranked 1 on every hub it's ever touched, which is obviously not what we want. parse hubs.py """Parse Crunchbase hub pages markdown or HTML for /organization/ links.""" from future import annotations import re from typing import Any, Dict, List, Literal, Set from urllib.parse import urljoin, urlparse ContentFormat = Literal "html", "markdown" ORG IN TEXT = re.compile r" ?:https?:// ^/\s crunchbase\.com ?/organization/ a-z0-9 - + ", re.I, Company Name https://www.crunchbase.com/organization/slug ORG MD LINK = re.compile r"\ ^\ + \ \ \s <?https?:// ^ \s crunchbase\.com/organization/ a-z0-9 - + /? ?\s \ ", re.I, Bright Data markdown: multi-line link ending with /organization/slug "Display Name" ORG REL LINK = re.compile r"\ \ /organization/ a-z0-9 - + ?:\s+\" ^\" \" ?\s \ ", re.I, ORG BLOCKLIST = frozenset { "y-combinator", "techstars", "national-science-foundation", "masschallenge", "easme", "andreessen-horowitz", "sequoia-capital", "accel", } def slug to display name slug: str - str: return slug.replace "-", " " .title def append org rows: List Dict str, Any , seen slugs: Set str , , slug: str, title: str, hub url: str, max orgs: int, - None: if len rows = max orgs: return slug = slug.lower if slug in ORG BLOCKLIST or slug in seen slugs: return seen slugs.add slug base = f"{urlparse hub url .scheme}://{urlparse hub url .netloc}" name = title or "" .strip or slug to display name slug rows.append { "url": urljoin base, f"/organization/{slug}" , "slug": slug, "title": name, } def parse organizations from markdown markdown: str, hub url: str, , max orgs: int = 80, - List Dict str, Any : """Extract orgs from markdown links; fall back to bare organization URLs.""" seen slugs: Set str = set rows: List Dict str, Any = for match in ORG REL LINK.finditer markdown : slug = match.group 1 title = match.group 2 or "" .strip append org rows, seen slugs, slug=slug, title=title, hub url=hub url, max orgs=max orgs if len rows = max orgs: return rows for match in ORG MD LINK.finditer markdown : title, slug = match.group 1 .strip , match.group 2 append org rows, seen slugs, slug=slug, title=title, hub url=hub url, max orgs=max orgs if len rows = max orgs: return rows if rows: return rows for match in ORG IN TEXT.finditer markdown : append org rows, seen slugs, slug=match.group 1 , title="", hub url=hub url, max orgs=max orgs, if len rows = max orgs: break return rows def parse organizations from html html: str, hub url: str, , max orgs: int = 80, - List Dict str, Any : """Extract unique organization rows from hub page HTML.""" seen slugs: Set str = set rows: List Dict str, Any = for match in ORG IN TEXT.finditer html : append org rows, seen slugs, slug=match.group 1 , title="", hub url=hub url, max orgs=max orgs, if len rows = max orgs: break return rows def parse organizations body: str, hub url: str, , content format: ContentFormat = "markdown", max orgs: int = 80, - List Dict str, Any : if content format == "markdown": return parse organizations from markdown body, hub url, max orgs=max orgs return parse organizations from html body, hub url, max orgs=max orgs First-run gotcha I hit was a classic. My original parser expected absolute URLs https://www.crunchbase.com/organization/... , but Bright Data's markdown renderer produces relative links /organization/slug "Display Name" 🙃. So zero companies extracted on the first run — simply because the regex didn't match . So I just added ORG REL LINK to the parser and re-ran Stage 1 with --parse-only , fixing it at no additional API cost. This is why we cached our raw response bodies. Your parser will probably need trial-and-erroring more than once, and you don’t want to actually re-fetch the data for that. Output of this stage: A hub snapshot.json — 96 organizations across 4 hubs Fintech produced 26, Cybersecurity: 24, SaaS: 22, AI: 24 . Note that these are hub leaderboard entries, not full Crunchbase exports. Because the full Crunchbase lists run to thousands ; I'm taking the curated top slice on purpose, because the cleaner my source is, the more clearly the fuzzy lift shows up against it. Before clustering, I flatten the nested snapshot into one uniform record per company appearance. extract.py handles this: """From hub snapshot.json to raw records.json with company name per organization.""" from future import annotations import json from datetime import datetime, timezone from pathlib import Path from typing import Any, Dict, List def records from hub snapshot data: Dict str, Any - List Dict str, Any : records: List Dict str, Any = for hi, block in enumerate data.get "hubs" or : if block.get "error" : continue category = block.get "category" or "unknown" .strip hub url = block.get "hub url" or "" for ri, row in enumerate block.get "rows" or : if not isinstance row, dict : continue url = row.get "url" or "" .strip if not url or "/organization/" not in url.lower : continue title = row.get "title" or "" .strip slug = row.get "slug" or "" .strip company name = title or slug.replace "-", " " .title if slug else "" if not company name: continue records.append { "id": f"hub:{hi}:{ri}", "source": "crunchbase hub", "category": category, "company name": company name, "raw name": title or company name, "url": url, "domain": "www.crunchbase.com", "hub url": hub url, "position": ri + 1, } return records def build raw payload snapshot path: Path - Dict str, Any : raw = json.loads snapshot path.read text encoding="utf-8" if not isinstance raw.get "hubs" , list : raise ValueError f"{snapshot path}: expected hub snapshot with 'hubs' array" records = records from hub snapshot raw return { "extracted at": datetime.now timezone.utc .isoformat , "snapshot": str snapshot path.name , "record count": len records , "records": records, } def write raw records snapshot path: Path, out path: Path - Dict str, Any : payload = build raw payload snapshot path out path.parent.mkdir parents=True, exist ok=True out path.write text json.dumps payload, indent=2, ensure ascii=False + "n", encoding="utf-8", return payload The id field hub:0:3 , hub:2:11 , etc. is our stable key that links each raw record to its canonical cluster in Stage 4. Deterministic, derivable from position, and most importantly, easy to debug. Output: raw records.json — 96 rows, all source: "crunchbase hub" fields, tagged by category. Entity resolution reconciliation Stage 4 collapses duplicate company names into canonical clusters. In this dataset, 96 scraped rows become 88 canonical companies after normalization and fuzzy clustering. Four names show up on more than one hub — Callaghan Innovation and EISMEA on all four leaderboards, PayTic and SixThirty on two — which gives duplicate rows before clustering. After exact normalization there are 88 distinct normalized names, which happens to be the same count as final clusters at WRatio threshold 90 — meaning no additional fuzzy merges were needed beyond collapsing the cross-hub duplicates . I run reconciliation in two passes. See full code here for reconcile.py: https://gist.github.com/sixthextinction/5c711e48353f4f7765e13cc4bb1b25de reconcile.py LEGAL = re.compile r"b inc.?|llc.?|ltd.?|plc.?|corp.?|corporation|co.?|company|limited b", re.I, NON ALNUM = re.compile r" ^ws ", re.UNICODE def normalize company name s: str - str: s = s.lower .strip s = NON ALNUM.sub " ", s strip punctuation s = LEGAL.sub " ", s drop legal suffixes s = re.sub r"s+", " ", s .strip return s After normalization, I group records by their normalized string. "Lovable" , "lovable" , and "Lovable." all collapse into the same group. This removes trivial duplicates before the more expensive fuzzy-matching pass. TL;DR: Do the cheap pass first, expensive pass second — same reason you’d put a WHERE clause before a JOIN . For this dataset that’s 96 hash inserts — one normalize company name + one dict lookup per row — roughly ~O n . The important optimization is that normalization shrinks the search space before the quadratic fuzzy pass runs. Without Pass 1, naïve all-pairs fuzzy matching over n = 10,000 unique names would require: n n−1 /2 ≈ 50 million comparisons It takes my laptop ~ 1.3 µs per RapidFuzz WRatio call on ~30-character names, so that pushes our runtime toward ~60 seconds instead of milliseconds. Not ideal — which is exactly why Pass 1 exists, to reduce n before the O n² step becomes expensive. I then compare each exact group against existing clusters using WRatio from RapidFuzz. python reconcile.py from rapidfuzz import fuzz FUZZY SCORER = fuzz.WRatio def fuzzy merge groups groups: List List Dict str, Any , threshold: float, default: 90.0 - List Cluster : clusters: List Cluster = for group in sorted groups, key=lambda g: min source rank m.get "source" or "" for m in g , -len g , : rep = pick canonical name group placed = False for cluster in clusters: if FUZZY SCORER rep, cluster.canonical name = threshold: cluster.members.extend group cluster.canonical name = pick canonical name cluster.members cluster.canonical id = make canonical id cluster.canonical name placed = True break if not placed: clusters.append Cluster canonical id=make canonical id rep , canonical name=rep, members=list group , return clusters Here, we have to compare each group’s representative against existing cluster canonicals. So the worst case with g = 88 exact groups would be 0 + 1 + 2 + … + 87 = 3,828 comparisons That’s roughly ~O g² . RapidFuzz ships several scorers — see the rapidfuzz.fuzz docs https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html for the full list. We use fuzz.WRatio https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html wratio weighted ratio; same algorithm family as FuzzyWuzzy’s WRatio https://github.com/seatgeek/fuzzywuzzy because company names drift in different ways and no single metric covers all of them. WRatio https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html wratio is a meta-scorer : for each pair of strings it runs several ratio algorithms internally with length-based weighting and returns the best score. It combines: Necker FinTech vs Necker FinTech Holdings Inc. looks like a poor match . Inc. or Group .You rarely know in advance which kind of drift a CRM row will have — suffix appended, spacing changed, words reordered. WRatio picks the strategy that scores highest for that specific pair , which is exactly what you want for entity resolution on names alone. We default to threshold 90 : strict enough that unrelated pairs Stripe vs Climate Corp stay out, loose enough that real variants PointsKash vs Points Kash merge. Tune it on your data. On this dataset specifically, WRatio handles the drift patterns we actually see in company names or historically have, anyway : | Hub / scraped name | CRM variant in sample crm.json | Drift type | |---|---|---| Necker FinTech | Necker FinTech Holdings Inc. | Legal suffix + spacing Fin Tech vs FinTech | PANTA | PANTA Group | Type descriptor appended | Physical Intelligence | Physical Intelligence Pi , Inc. | Parenthetical + legal suffix | PointsKash | Points Kash | Token spacing | qBotica | q Botica | Token spacing | Pure ratio edit distance would heavily penalize Necker FinTech vs Necker FinTech Holdings Inc. because three extra words add significant distance. So partial ratio handles containment and token set ratio handles reordering. WRatio picks the strategy that produces the best score for each specific pair — which is exactly the behavior you want when you don't know in advance how a name is going to drift. Display names with no shared tokens to the legal entity — e.g. hub title Investing.com vs operator Fusion Media Limited , or Lyrie.ai vs OTT Cybersecurity Inc. — stay below threshold. WRatio correctly refuses to merge them. That’s a good thing — those belong in a lookup table or enrichment API, not in a string-similarity pass see Caveats . One last thing before we move on to the demo — every time a cluster gains new members, I re-evaluate its canonical name. The source ranking crunchbase hub = 0, anything else = 99 ensures that short, clean display names win over longer legal variants: php def pick canonical name members: Sequence Dict str, Any - str: def sort key m : name = m.get "company name" or "" .strip return source rank m.get "source" or "" , len name , name.lower return min members, key=sort key "company name" .strip A Crunchbase display name like "Lovable" will always beat "Lovable Technologies Inc." as the canonical — it's from a trusted source and it's shorter. The legal variant ends up as an alias, which is exactly the right relationship. Output of this stage: reconciled.json — 88 canonical clusters, alias mappings with WRatio scores, and CRM join metrics. That’s it, we’re all done with the fuzzy pipeline. Let’s see if that improved things. Our sample crm.json simulates the data you’d get from a real CRM — I simply researched legal names and known alternate spellings online for the companies I had, and put it in a JSON file. This gave me 138 rows representing the same 88 canonical companies. Some companies had one exact-match entry — these are easy for us to handle. Others had three or four variants that I’d name like this: { "id": "crm:necker fintech 0", "company name": "Necker Fin Tech" }, { "id": "crm:necker fintech 1", "company name": "Necker FinTech Group" }, { "id": "crm:necker fintech 2", "company name": "Necker FinTech Holdings Inc." } Our join logic in the post fuzzy eval.py demo runs exact normalization first, then falls back to fuzzy — note how this is the same “cheap pass first” pattern as the cluster builder: post fuzzy eval.py """Optional CRM join evaluation — exact vs fuzzy match rates not part of core reconcile .""" from future import annotations import json from pathlib import Path from typing import Any, Dict, List, Optional, Sequence from rapidfuzz import fuzz from reconcile import Cluster, DEFAULT THRESHOLD, exact groups, normalize company name, FUZZY SCORER = fuzz.WRatio def load crm path: Path - List Dict str, Any : raw = json.loads path.read text encoding="utf-8" if isinstance raw, list : rows = raw elif isinstance raw, dict and "companies" in raw: rows = raw "companies" else: raise ValueError f"{path}: expected list or {{'companies': ... }}" out: List Dict str, Any = for i, row in enumerate rows : if not isinstance row, dict : continue name = row.get "company name" or "" .strip if not name: continue out.append { "id": row.get "id" or f"crm:{i}", "company name": name, } return out def record to cluster map clusters: Sequence Cluster - Dict str, str : out: Dict str, str = {} for cluster in clusters: for m in cluster.members: out m "id" = cluster.canonical id return out def crm to canonical crm rows: Sequence Dict str, Any , clusters: Sequence Cluster , threshold: float, - Dict str, Optional str : out: Dict str, Optional str = {} for row in crm rows: key = str row.get "id" or row.get "company name" name = row.get "company name" or "" .strip if not name: out key = None continue norm = normalize company name name matched: Optional str = None for cluster in clusters: if any normalize company name m.get "company name" or "" == norm for m in cluster.members : matched = cluster.canonical id break if not matched: best score = 0.0 best id: Optional str = None for cluster in clusters: score = FUZZY SCORER name, cluster.canonical name if score best score: best score = score best id = cluster.canonical id matched = best id if best score = threshold else None out key = matched return out def join metrics records: Sequence Dict str, Any , crm rows: Sequence Dict str, Any , clusters: Sequence Cluster , threshold: float, - Dict str, Any : record to cid = record to cluster map clusters crm to cid = crm to canonical crm rows, clusters, threshold crm norms = { normalize company name r.get "company name" or "" for r in crm rows if normalize company name r.get "company name" or "" } crm mapped cids = {v for v in crm to cid.values if v} scraped exact = 0 scraped fuzzy = 0 for r in records: norm = normalize company name r.get "company name" or "" if norm in crm norms: scraped exact += 1 cid = record to cid.get r "id" if cid and cid in crm mapped cids: scraped fuzzy += 1 crm exact = 0 crm fuzzy = 0 scraped norms = { normalize company name r.get "company name" or "" for r in records } scraped cids = set record to cid.values for row in crm rows: norm = normalize company name row.get "company name" or "" if norm in scraped norms: crm exact += 1 cid key = str row.get "id" or row.get "company name" cid = crm to cid.get cid key if cid and cid in scraped cids: crm fuzzy += 1 n scraped = len records or 1 n crm = len crm rows or 1 return { "scraped rows": len records , "crm rows": len crm rows , "canonical clusters": len clusters , "exact normalized unique": len exact groups records , "scraped exact join pct": round 100.0 scraped exact / n scraped, 1 , "scraped fuzzy join pct": round 100.0 scraped fuzzy / n scraped, 1 , "crm exact join pct": round 100.0 crm exact / n crm, 1 , "crm fuzzy join pct": round 100.0 crm fuzzy / n crm, 1 , } def eval crm join records: Sequence Dict str, Any , clusters: Sequence Cluster , crm path: Path, threshold: float = DEFAULT THRESHOLD, - Dict str, Any : """Load CRM file and compute join metrics against existing clusters.""" crm rows = load crm crm path return join metrics records, crm rows, clusters, threshold Here’s how we measure this JOIN operation join metrics in post fuzzy eval.py : company name appears in the set of normalized CRM names.So how did we do? | Question | Exact match | Fuzzy WRatio ≥ 90 | |---|---|---| Of 96 scraped rows, how many link to a CRM row? | 58.3% 56 rows | 100% 96 rows | Of 138 CRM rows, how many link back to scraped data? | 34.8% 48 rows | 100% 138 rows | The 58.3% exact baseline isn’t actually bad — over half of raw hub titles normalize to a CRM string exactly. The other 41.7% however, absolutely need fuzzy matching via WRatio because the CRM holds legal or alternate spellings Necker FinTech Holdings Inc. vs hub Necker FinTech , etc. that no amount of lowercasing or other normalization will save you from. The fuzzy pass closes the gap on this dataset at WRatio threshold 90. WRatio is strict enough to avoid merging unrelated names while still picking up suffix and token drift — which is fantastic — just what we want Commands below assume Python 3.10+ and a venv. All of this runs locally; the only network calls are to Bright Data during the initial fetch. Install deps pip install rapidfuzz requests python-dotenv Fetch all 4 hubs costs API credits python fetch hubs.py Already have cached responses? Re-parse for free python fetch hubs.py --parse-only Extract + reconcile + print CRM metrics default: both stages python run fuzzy.py Regenerate sample crm.json from raw records optional python build sample crm.py Tune the threshold try 85 for more aggressive merging python run fuzzy.py --threshold 85 Run individual stages python run fuzzy.py --extract python run fuzzy.py --reconcile Sample CLI output after a full run: wrote data/raw records.json records: 96 category artificial intelligence: 24 category cybersecurity: 24 category fintech: 26 category saas: 22 wrote data/reconciled.json -- join metrics CRM -- scraped rows: 96 | exact-normalized unique: 88 | canonical clusters: 88 scraped - CRM exact: 58.3% | fuzzy: 100.0% CRM - scraped exact: 34.8% | fuzzy: 100.0% -- top 10 canonicals by alias count -- Callaghan Innovation 4 aliases, sources: crunchbase hub EISMEA 4 aliases, sources: crunchbase hub PayTic 2 aliases, sources: crunchbase hub SixThirty 2 aliases, sources: crunchbase hub ...more I’ve also added a diagnostic queue into the pipeline for low-confidence alias assignments — records whose WRatio against their cluster’s canonical falls below the threshold. This will show us merges that look suspicious and deserve a human eye: python reconcile.py def review queue records: Sequence Dict str, Any , clusters: Sequence Cluster , threshold: float, limit: int = 8, - List Tuple float, str, str, str : rid to cluster = {m "id" : c for c in clusters for m in c.members} lows = for r in records: c = rid to cluster.get r "id" name = r.get "company name" or "" score = FUZZY SCORER name, c.canonical name if score < threshold: lows.append score, name, c.canonical name, c.canonical id lows.sort key=lambda x: x 0 return lows :limit In production this would feed a human-review UI or write to a needs review table. Here it just prints to stdout — but my point stands: fuzzy matching isn't a black box. You can always surface the borderline decisions and let a human confirm them. That’s everything, thanks for reading Q: Do you need ML or vector embeddings for company name matching? A: No, not for stylistic drift legal suffixes, spacing, punctuation . Our pipeline uses RapidFuzz fuzz.WRatio https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html wratio —which is a rule-based string similarity, not a trained model. Q: What similarity threshold should you use with WRatio? A: Start at WRatio threshold 90 . At 90, unrelated pairs like Stripe vs Climate Corp score 45.0 and stay out, while suffix/spacing variants like Necker FinTech vs Necker FinTech Holdings Inc. score 90.0+ and merge. See the score cutoff https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html wratio parameter in the docs if you want early-exit optimization. Q: When does fuzzy matching fail for company names? A: When names share almost no tokens — e.g. brand Investing.com vs legal entity Fusion Media Limited WRatio 30.0 . Use a lookup table, domain, LEI, or enrichment API instead. Q: Why not join on company name in SQL? A: Because raw name joins will often miss legal variants. Resolve each row to a canonical id in Python, load clusters into Postgres, and only then can you safely do a JOIN ... USING canonical id . I should clear some things up about this tutorial. ARYZE ApS , Count Finance LTD , PANTA Group . A real CRM would actually be dirtier: misspellings, stale names, entries from multiple import sources with inconsistent formatting. In practice the fuzzy pass may not hit 100%, but it'll still get you much closer than exact matching does. Investing.com / Fusion Media Limited or Lyrie.ai / OTT Cybersecurity Inc. share almost no tokens, so WRatio stays low and that's slug → legal name map. Fuzzy matching handles stylistic drift on the The normalize → exact-group → fuzzy-cluster → CRM join pattern I’ve described here applies directly to: Acme Corp , Acme Corporation , and ACME before they become three separate accounts in your sales pipeline. That WRatio threshold is something you should play around with. At WRatio threshold 90 the default in this pipeline , clearly unrelated pairs stay out Stripe vs Climate Corp scores 45.0 while suffix and spacing drift gets in. Drop to 80 and you'll catch more variants but start seeing false positives. This will differ based on your dataset, obviously, and the review queue is your safety net either way. Next step in production: load reconciled.json into Postgres, resolve each CRM row to a canonical id same logic as crm to canonical in Python , then join on that key instead of company name . -- Tables loaded from pipeline output reconciled.json + raw records + sample crm CREATE TABLE canonicals canonical id TEXT PRIMARY KEY, canonical name TEXT NOT NULL ; CREATE TABLE entity aliases canonical id TEXT NOT NULL REFERENCES canonicals canonical id , alias name TEXT NOT NULL, source TEXT, match score NUMERIC, PRIMARY KEY canonical id, alias name ; CREATE TABLE hub scrape id TEXT PRIMARY KEY, company name TEXT NOT NULL, canonical id TEXT REFERENCES canonicals canonical id , category TEXT, url TEXT ; CREATE TABLE crm accounts id TEXT PRIMARY KEY, company name TEXT NOT NULL, canonical id TEXT REFERENCES canonicals canonical id -- from Python CRM mapping ; -- Broken: join on raw company name SELECT COUNT AS matched rows FROM hub scrape h JOIN crm accounts c ON c.company name = h.company name; -- 56 / 96 ~58% on this dataset -- Fixed: join on canonical id assigned during ETL from reconciled.json SELECT h.company name AS hub name, c.company name AS crm name, h.canonical id FROM hub scrape h JOIN crm accounts c USING canonical id WHERE h.company name = 'Necker FinTech'; -- hub name: Necker FinTech -- crm name: Necker FinTech Holdings Inc. or Necker Fin Tech, etc. -- canonical id: c necker fintech canonicals and entity aliases from reconciled.json . hub scrape.canonical id from the aliases array id → canonical id . crm accounts.canonical id with the same crm to canonical logic you already run in Python exact norm match, then WRatio ≥ 90 .After that, SQL stays a plain equi-join — fuzzy matching happens once upstream, and not inside the database. I won’t cover that though; the pattern is the point, not the warehouse you choose to use. None of this is new — entity resolution is a well-studied problem with industrial-strength tools Dedupe, Splink https://moj-analytical-services.github.io/splink/demos/examples/duckdb/deterministic dedupe.html , various record linkage toolkits when you need them. But for the common case of “I have two lists of company names and I need to join them,” you really don’t. A normalization pass and a WRatio threshold gets you most of the way there in an afternoon, in pure Python, with zero infrastructure.