{"slug": "a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning", "title": "A Practical Guide To Entity Resolution in Python (No Database, No Machine Learning)", "summary": "A developer achieved a 100% join rate on a 96-record Crunchbase-to-CRM dataset by replacing exact string matching with fuzzy matching using the RapidFuzz library, up from a 58% rate with normalized exact matching. The pipeline, built without databases or machine learning, used RapidFuzz's `fuzz.WRatio` function with a threshold of 90 to match company name variants like \"Necker FinTech\" and \"Necker FinTech Holdings Inc.\" that exact equality checks would miss.", "body_md": "**TL;DR:** Learn a very simple way to normalize, dedupe, and fuzzy-match records that refer to the same real-world entity in Python, without a database or any ML pipelines.\n\nI was working on a Crunchbase dataset last Friday. I joined it against our CRM, and got 56 hits out of 96. The other 40 were sitting *right there* in both tables — `Necker FinTech`\n\nin the extracted data was`Necker FinTech Holdings Inc.`\n\nin the CRM; `Investing.com`\n\nin the data was`Fusion Media Limited`\n\nin the CRM — but `JOIN ... ON name = name`\n\nobviously doesn't care, it will shrug and return nothing. If I'd shipped that, some sales rep would end up cold-pitching an existing customer because of it. 😅\n\n[This is the core problem of entity resolution](https://en.wikipedia.org/wiki/Record_linkage): the same real-world entity wearing different names in different systems. Naive text equality checks are borderline useless in the real world. I’d been meaning to do something less embarrassing than a raw `==`\n\nfor a while, so I spent the rest of the weekend on a simple pipeline — scrape company names from Crunchbase hubs via [Bright Data](https://get.brightdata.com/bd7914?utm_content=a_practical_guide_to_entity_resolution_in_python_no_database_no_machine_learning), normalize, deduplicate, and fuzzy-match against the CRM list using [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz) (`fuzz.WRatio`\n\n). **Deliberately choosing to NOT use ML, vector embeddings, or a database.**\n\nThe join rate on this dataset jumped from **~58% to 100%**.\n\n| Metric | Exact (normalized string) | Fuzzy (WRatio ≥ 90) |\n|---|---|---|\n| Scraped hub rows → CRM | 58.3% (56 / 96) |\n100% (96 / 96) |\n| CRM rows → scraped data | 34.8% (48 / 138) |\n100% (138 / 138) |\n\nThe reason exact matching loses so badly is that *any* real CRM list you’re handed will almost always have multiple legal-name variants per company — I had three different *Necker* spellings pointing at one hub listing alone. Fuzzy matching earns its keep by collapsing those variants back into a single canonical cluster, and that’s most of what the rest of this post is about.\n\nI’ll walk through it; I hope it’s useful for anyone starting with fuzzy algorithms!\n\n**Entity resolution** matches records that describe the same company under different surface strings.\n\n**If you use exact matching, you ask**: *are these two strings identical?* After you lowercase and strip punctuation, `\"Necker FinTech\"`\n\nand `\"Necker FinTech Holdings Inc.\"`\n\nare still different strings — so a SQL `JOIN`\n\nor a Python `==`\n\ncheck will incorrectly say no match.\n\n```\n-- Exact join on raw names returns no row when spellings differ  \nSELECT h.company_name AS hub_name, c.company_name AS crm_name  \nFROM   hub_scrape h  \nJOIN   crm_accounts c ON c.company_name = h.company_name  \nWHERE  h.company_name = 'Necker FinTech';  \n-- This will return 0 rows   \n-- Remember, CRM has \"Necker FinTech Holdings Inc.\", not the Crunchbase title\n```\n\n**This is why you use Fuzzy matching.** That asks a looser question: *how similar are these two strings*? You get a score — usually 0 to 100 — instead of `true`\n\nor `false`\n\n. Names that are clearly the same company but spelled differently (`Necker FinTech`\n\nvs `Necker FinTech Holdings Inc.`\n\n) will score high, while unrelated names will score low. You pick a **threshold** (we use 90): if the score is at or above it, you treat the pair as a match; otherwise you don't.\n\n``` php\nfrom rapidfuzz import fuzz  \nTHRESHOLD = 90  \ndef is_match(a: str, b: str) -> bool:  \n    return fuzz.WRatio(a, b) >= THRESHOLD  \npairs = [  \n    (\"Necker FinTech\", \"Necker FinTech Holdings Inc.\"),   # same company, legal suffix  \n    (\"PointsKash\", \"Points Kash\"),                        # same company, spacing  \n    (\"Investing.com\", \"Fusion Media Limited\"),            # brand vs legal entity  \n    (\"Stripe\", \"Climate Corp\"),                           # different companies  \n]  \nfor a, b in pairs:  \n    score = fuzz.WRatio(a, b)  \n    print(f\"{score:5.1f}  match={score >= THRESHOLD!s:5}  {a!r}  vs  {b!r}\")\n```\n\nThis is the same scoring logic we’ll use for the rest of the tutorial, so `pip install rapidfuzz`\n\nis all you need to follow along.\n\nRunning the demo pairs above with `fuzz.WRatio`\n\nand **WRatio threshold 90** yields:\n\n| Pair | WRatio | Match at ≥ 90? | Drift type |\n|---|---|---|---|\n`Necker FinTech` vs `Necker FinTech Holdings Inc.`\n|\n90.0 | Yes | Legal suffix |\n`PointsKash` vs `Points Kash`\n|\n95.2 | Yes | Token spacing |\n`Investing.com` vs `Fusion Media Limited`\n|\n30.0 | No | Brand vs legal entity |\n`Stripe` vs `Climate Corp`\n|\n45.0 | No | Unrelated companies |\n\nThink of it like a strict spell-check or a “did you mean X?” suggestion, but for whole company names. **It is not machine learning — no model is trained on your data.** The library compares characters and words using fixed rules: how many edits to turn one string into another, whether one name is contained in the other, whether the same words appear in a different order. That’s why it’s fast, easy to audit, and good enough for a large class of real-world messiness — extra words, `Inc.`\n\nvs `LLC`\n\n, odd spacing, punctuation.\n\n💡If two names share almost no letters —Investing.comandFusion Media Limitedfor example — the score stays low and fuzzy matching correctly refuses to merge them. Those cases need a real identifier (domain, LEI, enrichment API, some sort of ML pipeline etc.), not smarter string math.\n\nHere’s a quick summary.\n\n| Approach | Best when | Used in this pipeline? |\n|---|---|---|\nFuzzy matching (RapidFuzz WRatio) |\nSame entity, stylistic drift — legal suffixes, spacing, punctuation | Yes — primary method |\nLookup table / enrichment API |\nBrand vs legal name; names share almost no tokens | Partial — `RESEARCHED` dict in `build_sample_crm.py`\n|\n| (GLEIF, Clearbit, domain) | ||\nML record linkage (Dedupe, Splink) |\nLarge-scale probabilistic linkage, many fields beyond name | No — names-only, no training step |\n\nBasically, choose fuzzy matching when two name strings likely describe the same company but spell it differently.\n\nOnly choose a lookup or enrichment layer when the strings are *related* entities (brand vs operator) rather than variants of one name.\n\n**Entity resolution in this pipeline** is a fetch → extract → normalize → fuzzy-cluster → join loop on `canonical_id`\n\n.\n\n```\nhub_urls.json  \n      │  \n      ▼  \nfetch_hubs.py ──calls──► bright_data_unlocker.py     Bright Data POST → page body (markdown/HTML)  \n      │                           │  \n      └──calls──► parse_hubs.py ◄─┘                  regex → org slug + display name  \n      │  \n      ▼  \nhub_snapshot.json                                    (+ cached bodies in data/hub_responses/)  \n\nextract.py ──► raw_records.json                      flat table  \n\nreconcile.py ──► reconciled.json                     canonical clusters + aliases  \n\nrun_fuzzy.py                                         CLI part. This just runs extract + reconcile   \n\n── optional eval ──  \npost_fuzzy_eval.py                                   All done, so run a real-world test, calc metrics, then print to stdout\n```\n\n**Each stage is a pure transform: JSON in, JSON out.** Nothing stateful, nothing that requires a running service, and nothing I can't `git diff`\n\nbetween runs.\n\nI’m scraping four Crunchbase hub leaderboard pages, defined in a `hub_urls.json`\n\n:\n\n```\n[  \n  { \"category\": \"fintech\",                \"url\": \"https://www.crunchbase.com/hub/fintech-companies-seed-funding\" },  \n  { \"category\": \"cybersecurity\",          \"url\": \"https://www.crunchbase.com/hub/cyber-security-startups\" },  \n  { \"category\": \"saas\",                   \"url\": \"https://www.crunchbase.com/hub/saas-companies-seed-funding\" },  \n  { \"category\": \"artificial_intelligence\",\"url\": \"https://www.crunchbase.com/hub/artificial-intelligence-companies-early-stage-venture-funding\" }  \n]\n```\n\nReplace with your own, obviously.\n\nCrunchbase is a JavaScript-heavy SPA — it won’t respond to a plain `requests.get`\n\n. So before we fetch, I use [Bright Data's Web Unlocker](https://get.brightdata.com/bd-web-unlocker?utm_content=a_practical_guide_to_entity_resolution_in_python_no_database_no_machine_learning), which handles JS rendering and anti-bot for me.\n\n[Sign up here -- Automated Web Unblocker](https://get.brightdata.com/bd-web-unlocker?utm_content=a_practical_guide_to_entity_resolution_in_python_no_database_no_machine_learning&source=post_page-----89d55badaeac---------------------------------------)\n\nI set up a reusable client for this, and this is just a thin wrapper around their single `POST`\n\nendpoint `https://api.brightdata.com/request`\n\n. Make sure you’ve signed up, and have these set in your .env file first:\n\n```\nBRIGHTDATA_API_TOKEN=your_api_token  \nBRIGHTDATA_ZONE=your_web_unlocker_zone_name\n```\n\n**bright_data_unlocker.py**\n\n``` python\n\"\"\"Fetch hub/listing pages as HTML or markdown.\"\"\"\nfrom __future__ import annotations\n\nimport json\nimport os\nimport time\nfrom typing import Any, Dict, Literal, Optional\n\nimport requests\nfrom dotenv import load_dotenv\n\nload_dotenv()\n\nContentFormat = Literal[\"html\", \"markdown\"]\n\nclass BrightDataUnlockerClient:\n    \"\"\"POST https://api.brightdata.com/request (Web Unlocker zone).\"\"\"\n\n    def __init__(\n        self,\n        api_key: Optional[str] = None,\n        zone: Optional[str] = None,\n        country: Optional[str] = None,\n    ):\n        self.api_key = api_key or os.getenv(\"BRIGHT_DATA_API_KEY\")\n        self.zone = zone or os.getenv(\"BRIGHT_DATA_UNLOCKER_ZONE\")\n        self.country = country or os.getenv(\"BRIGHT_DATA_COUNTRY\") # optional\n        self.api_endpoint = \"https://api.brightdata.com/request\"\n\n        if not self.api_key:\n            raise ValueError(\"BRIGHT_DATA_API_KEY is required.\")\n        if not self.zone:\n            raise ValueError(\n                \"BRIGHT_DATA_UNLOCKER_ZONE is required. \"\n                \"Create a Web Unlocker API zone in Bright Data.\"\n            )\n\n        self.session = requests.Session()\n        self.session.headers.update(\n            {\n                \"Content-Type\": \"application/json\",\n                \"Authorization\": f\"Bearer {self.api_key}\",\n            }\n        )\n\n    def fetch(\n        self,\n        url: str,\n        *,\n        content_format: ContentFormat = \"markdown\",\n        max_retries: int = 2,\n    ) -> str:\n        \"\"\"Fetch page body. markdown => format=raw + data_format=markdown (Bright Data).\"\"\"\n        last_err: Optional[Exception] = None\n        for attempt in range(max_retries + 1):\n            try:\n                return self._do_fetch(url, content_format=content_format)\n            except Exception as e:\n                last_err = e\n                if attempt < max_retries:\n                    time.sleep(0.5 * (attempt + 1))\n        assert last_err is not None\n        raise last_err\n\n    def fetch_markdown(self, url: str, max_retries: int = 2) -> str:\n        return self.fetch(url, content_format=\"markdown\", max_retries=max_retries)\n\n    def fetch_html(self, url: str, max_retries: int = 2) -> str:\n        return self.fetch(url, content_format=\"html\", max_retries=max_retries)\n\n    def _do_fetch(self, url: str, *, content_format: ContentFormat) -> str:\n        payload: Dict[str, Any] = {\n            \"zone\": self.zone,\n            \"url\": url,\n            \"format\": \"raw\",\n        }\n        if content_format == \"markdown\":\n            payload[\"data_format\"] = \"markdown\"\n        if self.country:\n            payload[\"country\"] = self.country\n\n        response = self.session.post(self.api_endpoint, json=payload, timeout=120)\n        response.raise_for_status()\n\n        try:\n            result = response.json()\n        except json.JSONDecodeError:\n            # data_format=markdown often returns the page body directly, not a JSON envelope\n            text = response.text\n            if not text.strip():\n                raise RuntimeError(\"Bright Data Unlocker empty response body\")\n            return text\n\n        if not isinstance(result, dict):\n            raise RuntimeError(f\"Bright Data unexpected response type: {type(result)}\")\n\n        inner_status = result.get(\"status_code\")\n        if inner_status is not None and inner_status != 200:\n            raise RuntimeError(f\"Bright Data Unlocker status_code={inner_status}\")\n\n        body = result.get(\"body\")\n        if body is None:\n            if \"status_code\" in result and result.get(\"status_code\") == 200:\n                raise RuntimeError(\"Bright Data Unlocker empty body\")\n            raise RuntimeError(f\"Bright Data Unlocker missing body: {list(result.keys())}\")\n\n        if isinstance(body, str):\n            if body.strip().startswith(\"{\"):\n                try:\n                    nested = json.loads(body)\n                    if isinstance(nested, dict) and \"body\" in nested:\n                        body = nested[\"body\"]\n                except json.JSONDecodeError:\n                    pass\n            if not str(body).strip():\n                raise RuntimeError(\"Bright Data Unlocker empty body string\")\n            return str(body)\n        if isinstance(body, dict):\n            return json.dumps(body)\n        return str(body)\n```\n\nNote how we can request`data_format=markdown`\n\n. Using this param, Bright Data returns a sanitized markdown rendering of the page, which is *much* easier to parse with regex than raw HTML.\n\n💡 If markdown still yields zero orgs for a hub, fetch_hubs.py --fallback-html can fetch or use cached HTML and run the HTML parser instead.\n\nWith that in place, here’s our actual fetch script — `fetch_hubs.py`\n\n**fetch_hubs.py**\n\n```\n\"\"\"Fetch Crunchbase hub pages via Bright Data Web Unlocker; write hub_snapshot.json.\"\"\"\n\nfrom __future__ import annotations\n\nimport argparse\nimport json\nimport time\nfrom datetime import datetime, timezone\nfrom pathlib import Path\nfrom typing import Any, Dict, List, Optional\n\nfrom dotenv import load_dotenv\n\nfrom bright_data_unlocker import BrightDataUnlockerClient, ContentFormat\nfrom parse_hubs import parse_organizations\n\nload_dotenv()\n\n_ROOT = Path(__file__).resolve().parent\n_DEFAULT_RESPONSES_DIR = _ROOT / \"data\" / \"hub_responses\"\n\ndef load_hub_urls(path: Path) -> List[Dict[str, str]]:\n    raw = json.loads(path.read_text(encoding=\"utf-8\"))\n    if not isinstance(raw, list):\n        raise ValueError(\"hub_urls.json must be a JSON array\")\n    out: List[Dict[str, str]] = []\n    for item in raw:\n        if not isinstance(item, dict):\n            continue\n        url = (item.get(\"url\") or \"\").strip()\n        category = (item.get(\"category\") or \"unknown\").strip()\n        if url:\n            out.append({\"category\": category, \"url\": url})\n    return out\n\ndef _response_file(category: str, content_format: ContentFormat) -> str:\n    ext = \"md\" if content_format == \"markdown\" else \"html\"\n    safe = \"\".join(c if c.isalnum() or c in \"-_\" else \"_\" for c in category)\n    return f\"{safe}.{ext}\"\n\ndef _response_path(\n    responses_dir: Path, category: str, content_format: ContentFormat\n) -> Path:\n    return responses_dir / _response_file(category, content_format)\n\ndef load_cached_body(\n    responses_dir: Path, category: str, content_format: ContentFormat\n) -> Optional[str]:\n    path = _response_path(responses_dir, category, content_format)\n    if not path.is_file() or path.stat().st_size == 0:\n        return None\n    return path.read_text(encoding=\"utf-8\")\n\ndef save_response_body(\n    responses_dir: Path,\n    category: str,\n    hub_url: str,\n    content_format: ContentFormat,\n    body: str,\n) -> Path:\n    responses_dir.mkdir(parents=True, exist_ok=True)\n    path = _response_path(responses_dir, category, content_format)\n    path.write_text(body, encoding=\"utf-8\")\n    return path\n\ndef _manifest_path(responses_dir: Path) -> Path:\n    return responses_dir / \"manifest.json\"\n\ndef _load_manifest(responses_dir: Path) -> Dict[str, Any]:\n    path = _manifest_path(responses_dir)\n    if not path.is_file():\n        return {\"hubs\": []}\n    return json.loads(path.read_text(encoding=\"utf-8\"))\n\ndef _upsert_manifest_entry(\n    responses_dir: Path,\n    category: str,\n    hub_url: str,\n    content_format: ContentFormat,\n    response_path: Path,\n    *,\n    fetched_at: str,\n) -> None:\n    entry = {\n        \"category\": category,\n        \"hub_url\": hub_url,\n        \"content_format\": content_format,\n        \"response_file\": response_path.name,\n        \"fetched_at\": fetched_at,\n    }\n    manifest = _load_manifest(responses_dir)\n    hubs = [h for h in manifest.get(\"hubs\") or [] if h.get(\"category\") != category]\n    hubs.append(entry)\n    manifest[\"hubs\"] = hubs\n    manifest[\"updated_at\"] = datetime.now(timezone.utc).isoformat()\n    _manifest_path(responses_dir).write_text(\n        json.dumps(manifest, indent=2, ensure_ascii=False) + \"\\n\",\n        encoding=\"utf-8\",\n    )\n\ndef _parse_body(\n    body: str,\n    hub_url: str,\n    content_format: ContentFormat,\n    max_orgs: int,\n) -> List[Dict[str, Any]]:\n    return parse_organizations(body, hub_url, content_format=content_format, max_orgs=max_orgs)\n\ndef main() -> None:\n    ap = argparse.ArgumentParser(\n        description=\"Fetch Crunchbase hub pages (Web Unlocker) and extract organization URLs.\",\n    )\n    ap.add_argument(\"--hubs-json\", type=Path, default=_ROOT / \"hub_urls.json\")\n    ap.add_argument(\"--out\", type=Path, default=_ROOT / \"data\" / \"hub_snapshot.json\")\n    ap.add_argument(\n        \"--format\",\n        choices=(\"markdown\", \"html\"),\n        default=\"markdown\",\n    )\n    ap.add_argument(\"--max-orgs-per-hub\", type=int, default=80)\n    ap.add_argument(\"--delay\", type=float, default=1.0)\n    ap.add_argument(\n        \"--responses-dir\",\n        type=Path,\n        default=_DEFAULT_RESPONSES_DIR,\n        help=\"Directory for cached raw hub page bodies (default: data/hub_responses).\",\n    )\n    ap.add_argument(\n        \"--refetch\",\n        action=\"store_true\",\n        help=\"Call Bright Data even if a cached response file exists.\",\n    )\n    ap.add_argument(\n        \"--parse-only\",\n        action=\"store_true\",\n        help=\"Parse cached responses only; never call Bright Data.\",\n    )\n    ap.add_argument(\n        \"--fallback-html\",\n        action=\"store_true\",\n        help=\"If markdown parse finds 0 orgs, try cached or fetched HTML.\",\n    )\n    args = ap.parse_args()\n\n    responses_dir = args.responses_dir\n\n    hubs = load_hub_urls(args.hubs_json)\n    if not hubs:\n        raise SystemExit(\"No hubs in hub_urls.json\")\n\n    args.out.parent.mkdir(parents=True, exist_ok=True)\n    client: Optional[BrightDataUnlockerClient] = None\n    if not args.parse_only:\n        client = BrightDataUnlockerClient()\n\n    content_format: ContentFormat = args.format\n\n    payload: Dict[str, Any] = {\n        \"fetched_at\": datetime.now(timezone.utc).isoformat(),\n        \"source\": \"bright_data_web_unlocker\",\n        \"content_format\": content_format,\n        \"responses_dir\": str(responses_dir),\n        \"hubs\": [],\n    }\n\n    n_hubs = len(hubs)\n    for i, hub in enumerate(hubs, start=1):\n        category = hub[\"category\"]\n        url = hub[\"url\"]\n        print(f\"\\n[{i}/{n_hubs}] hub [{category}]: starting...\", flush=True)\n        block: Dict[str, Any] = {\n            \"category\": category,\n            \"hub_url\": url,\n            \"error\": None,\n            \"organic_count\": 0,\n            \"rows\": [],\n            \"response_file\": _response_file(category, content_format),\n        }\n        parse_format: ContentFormat = content_format\n\n        try:\n            body: Optional[str] = None\n            if not args.refetch:\n                body = load_cached_body(responses_dir, category, content_format)\n\n            if body is None:\n                if args.parse_only:\n                    raise FileNotFoundError(\n                        f\"no cached response at {_response_path(responses_dir, category, content_format)} \"\n                        \"(run without --parse-only to fetch)\"\n                    )\n                print(\n                    f\"[{i}/{n_hubs}] hub [{category}]: fetching ({content_format})...\",\n                    flush=True,\n                )\n                assert client is not None\n                body = client.fetch(url, content_format=content_format)\n                print(\n                    f\"[{i}/{n_hubs}] hub [{category}]: fetch done \"\n                    f\"({len(body):,} chars)\",\n                    flush=True,\n                )\n                saved = save_response_body(\n                    responses_dir, category, url, content_format, body\n                )\n                _upsert_manifest_entry(\n                    responses_dir,\n                    category,\n                    url,\n                    content_format,\n                    saved,\n                    fetched_at=datetime.now(timezone.utc).isoformat(),\n                )\n                print(f\"[{i}/{n_hubs}] hub [{category}]: saved {saved}\", flush=True)\n            else:\n                print(\n                    f\"[{i}/{n_hubs}] hub [{category}]: using cache \"\n                    f\"{_response_path(responses_dir, category, content_format)}\",\n                    flush=True,\n                )\n\n            print(f\"[{i}/{n_hubs}] hub [{category}]: parsing...\", flush=True)\n            rows = _parse_body(body, url, parse_format, args.max_orgs_per_hub)\n\n            if not rows and args.fallback_html and parse_format == \"markdown\":\n                html_body = load_cached_body(responses_dir, category, \"html\")\n                if html_body is None and not args.parse_only:\n                    print(\n                        f\"[{i}/{n_hubs}] hub [{category}]: markdown had 0 orgs, \"\n                        \"fetching HTML...\",\n                        flush=True,\n                    )\n                    assert client is not None\n                    html_body = client.fetch(url, content_format=\"html\")\n                    print(\n                        f\"[{i}/{n_hubs}] hub [{category}]: HTML fetch done \"\n                        f\"({len(html_body):,} chars)\",\n                        flush=True,\n                    )\n                    saved = save_response_body(\n                        responses_dir, category, url, \"html\", html_body\n                    )\n                    print(f\"[{i}/{n_hubs}] hub [{category}]: saved {saved}\", flush=True)\n                elif html_body is None:\n                    raise FileNotFoundError(\n                        f\"no cached HTML at {_response_path(responses_dir, category, 'html')}\"\n                    )\n                else:\n                    print(\n                        f\"[{i}/{n_hubs}] hub [{category}]: markdown had 0 orgs, \"\n                        \"using cached HTML...\",\n                        flush=True,\n                    )\n                print(f\"[{i}/{n_hubs}] hub [{category}]: parsing HTML...\", flush=True)\n                rows = _parse_body(html_body, url, \"html\", args.max_orgs_per_hub)\n                parse_format = \"html\"\n                block[\"response_file\"] = _response_file(category, \"html\")\n\n            block[\"content_format\"] = parse_format\n            block[\"organic_count\"] = len(rows)\n            block[\"rows\"] = rows\n            print(\n                f\"[{i}/{n_hubs}] hub [{category}]: done - \"\n                f\"{len(rows)} organizations\",\n                flush=True,\n            )\n\n        except Exception as e:\n            print(f\"[{i}/{n_hubs}] hub [{category}]: failed - {e}\", flush=True)\n            block[\"error\"] = str(e)\n\n        payload[\"hubs\"].append(block)\n        if not args.parse_only:\n            time.sleep(args.delay)\n\n    args.out.write_text(\n        json.dumps(payload, indent=2, ensure_ascii=False) + \"\\n\",\n        encoding=\"utf-8\",\n    )\n    total = sum(h.get(\"organic_count\") or 0 for h in payload[\"hubs\"])\n    print(\n        f\"\\nAll hubs processed. Wrote {args.out} \"\n        f\"({total} organizations across {n_hubs} hubs).\",\n        flush=True,\n    )\n\nif __name__ == \"__main__\":\n    main()\n```\n\nNote how I’m caching the raw bodies under `data/hub_responses/`\n\nso re-runs with `--parse-only`\n\ndon't burn any API credits.\n\nOur `parse_hubs.py`\n\npulls organization slugs and display names out of the cached page bodies from the previous step. It runs three regex patterns in priority order:\n\n```\n# parse_hubs.py  \n\n# Priority 1: Bright Data relative markdown links\n# Matches: ](/organization/slug \"Display Name\")\n_ORG_REL_LINK = re.compile(\n    r\"\\]\\(/organization/([a-z0-9_-]+)(?:\\s+\\\"([^\\\"]*)\\\")?\\s*\\)\",\n    re.I,\n)\n\n# Priority 2: Standard absolute markdown links\n# Matches: [Company Name](https://www.crunchbase.com/organization/slug)\n_ORG_MD_LINK = re.compile(\n    r\"\\[([^\\]]+)\\]\\(\\s*<?https?://[^>\\s)]*crunchbase.com/organization/([a-z0-9_-]+)/?>?\\s*\\)\",\n    re.I,\n)\n\n# Fallback: bare /organization/slug anywhere in text\n_ORG_IN_TEXT = re.compile(\n    r\"(?:https?://[^/\\s]*crunchbase.com)?/organization/([a-z0-9_-]+)\",\n    re.I,\n)\n```\n\nEach hub gets parsed into rows like:\n\n```\n{ \"url\": \"https://www.crunchbase.com/organization/lovable\", \"slug\": \"lovable\", \"title\": \"Lovable\" }\n```\n\nHere’s the full code for `parse_hubs.py`\n\n. Note that I also keep a blocklist of well-known VCs and accelerators (`y-combinator`\n\n, `techstars`\n\n, `andreessen-horowitz`\n\n, etc.) that show up on hub pages but are the *investors*, not the companies being listed. Without this, you get YC ranked #1 on every hub it's ever touched, which is obviously not what we want.\n\n**parse_hubs.py**\n\n```\n\"\"\"Parse Crunchbase hub pages (markdown or HTML) for /organization/ links.\"\"\"\n\nfrom __future__ import annotations\n\nimport re\nfrom typing import Any, Dict, List, Literal, Set\nfrom urllib.parse import urljoin, urlparse\n\nContentFormat = Literal[\"html\", \"markdown\"]\n\n_ORG_IN_TEXT = re.compile(\n    r\"(?:https?://[^/\\s]*crunchbase\\.com)?/organization/([a-z0-9_-]+)\",\n    re.I,\n)\n# [Company Name](https://www.crunchbase.com/organization/slug)\n_ORG_MD_LINK = re.compile(\n    r\"\\[([^\\]]+)\\]\\(\\s*<?https?://[^>\\s)]*crunchbase\\.com/organization/([a-z0-9_-]+)/?>?\\s*\\)\",\n    re.I,\n)\n# Bright Data markdown: multi-line link ending with ](/organization/slug \"Display Name\")\n_ORG_REL_LINK = re.compile(\n    r\"\\]\\(/organization/([a-z0-9_-]+)(?:\\s+\\\"([^\\\"]*)\\\")?\\s*\\)\",\n    re.I,\n)\n_ORG_BLOCKLIST = frozenset(\n    {\n        \"y-combinator\",\n        \"techstars\",\n        \"national-science-foundation\",\n        \"masschallenge\",\n        \"easme\",\n        \"andreessen-horowitz\",\n        \"sequoia-capital\",\n        \"accel\",\n    }\n)\n\ndef slug_to_display_name(slug: str) -> str:\n    return slug.replace(\"-\", \" \").title()\n\ndef _append_org(\n    rows: List[Dict[str, Any]],\n    seen_slugs: Set[str],\n    *,\n    slug: str,\n    title: str,\n    hub_url: str,\n    max_orgs: int,\n) -> None:\n    if len(rows) >= max_orgs:\n        return\n    slug = slug.lower()\n    if slug in _ORG_BLOCKLIST or slug in seen_slugs:\n        return\n    seen_slugs.add(slug)\n    base = f\"{urlparse(hub_url).scheme}://{urlparse(hub_url).netloc}\"\n    name = (title or \"\").strip() or slug_to_display_name(slug)\n    rows.append(\n        {\n            \"url\": urljoin(base, f\"/organization/{slug}\"),\n            \"slug\": slug,\n            \"title\": name,\n        }\n    )\n\ndef parse_organizations_from_markdown(\n    markdown: str,\n    hub_url: str,\n    *,\n    max_orgs: int = 80,\n) -> List[Dict[str, Any]]:\n    \"\"\"Extract orgs from markdown links; fall back to bare organization URLs.\"\"\"\n    seen_slugs: Set[str] = set()\n    rows: List[Dict[str, Any]] = []\n\n    for match in _ORG_REL_LINK.finditer(markdown):\n        slug = match.group(1)\n        title = (match.group(2) or \"\").strip()\n        _append_org(rows, seen_slugs, slug=slug, title=title, hub_url=hub_url, max_orgs=max_orgs)\n        if len(rows) >= max_orgs:\n            return rows\n\n    for match in _ORG_MD_LINK.finditer(markdown):\n        title, slug = match.group(1).strip(), match.group(2)\n        _append_org(rows, seen_slugs, slug=slug, title=title, hub_url=hub_url, max_orgs=max_orgs)\n        if len(rows) >= max_orgs:\n            return rows\n\n    if rows:\n        return rows\n\n    for match in _ORG_IN_TEXT.finditer(markdown):\n        _append_org(\n            rows,\n            seen_slugs,\n            slug=match.group(1),\n            title=\"\",\n            hub_url=hub_url,\n            max_orgs=max_orgs,\n        )\n        if len(rows) >= max_orgs:\n            break\n    return rows\n\ndef parse_organizations_from_html(\n    html: str,\n    hub_url: str,\n    *,\n    max_orgs: int = 80,\n) -> List[Dict[str, Any]]:\n    \"\"\"Extract unique organization rows from hub page HTML.\"\"\"\n    seen_slugs: Set[str] = set()\n    rows: List[Dict[str, Any]] = []\n\n    for match in _ORG_IN_TEXT.finditer(html):\n        _append_org(\n            rows,\n            seen_slugs,\n            slug=match.group(1),\n            title=\"\",\n            hub_url=hub_url,\n            max_orgs=max_orgs,\n        )\n        if len(rows) >= max_orgs:\n            break\n    return rows\n\ndef parse_organizations(\n    body: str,\n    hub_url: str,\n    *,\n    content_format: ContentFormat = \"markdown\",\n    max_orgs: int = 80,\n) -> List[Dict[str, Any]]:\n    if content_format == \"markdown\":\n        return parse_organizations_from_markdown(body, hub_url, max_orgs=max_orgs)\n    return parse_organizations_from_html(body, hub_url, max_orgs=max_orgs)\n```\n\n**First-run gotcha I hit was a classic.** My original parser expected absolute URLs (`https://www.crunchbase.com/organization/...`\n\n), but Bright Data's markdown renderer produces *relative* links (`/organization/slug \"Display Name\"`\n\n) 🙃. So zero companies extracted on the first run — *simply because the regex didn't match*.\n\nSo I just added`_ORG_REL_LINK`\n\nto the parser and re-ran Stage 1 with `--parse-only`\n\n, fixing it at no additional API cost. **This is why we cached our raw response bodies.** Your parser will probably need trial-and-erroring more than once, and you don’t want to actually re-fetch the data for that.\n\n**Output of this stage:** A `hub_snapshot.json`\n\n— 96 organizations across 4 hubs (Fintech produced 26, Cybersecurity: 24, SaaS: 22, AI: 24). Note that these are *hub leaderboard* entries, not full Crunchbase exports.\n\nBecause the full Crunchbase lists run to *thousands*; I'm taking the curated top slice on purpose, because the cleaner my source is, the more clearly the fuzzy lift shows up against it.\n\nBefore clustering, I flatten the nested snapshot into one uniform record per company appearance. `extract.py`\n\nhandles this:\n\n```\n\"\"\"From hub_snapshot.json to raw_records.json with company_name per organization.\"\"\"    \n\nfrom __future__ import annotations    \n\nimport json    \nfrom datetime import datetime, timezone    \nfrom pathlib import Path    \nfrom typing import Any, Dict, List    \n\ndef records_from_hub_snapshot(data: Dict[str, Any]) -> List[Dict[str, Any]]:    \n    records: List[Dict[str, Any]] = []    \n    for hi, block in enumerate(data.get(\"hubs\") or []):    \n        if block.get(\"error\"):    \n            continue    \n        category = (block.get(\"category\") or \"unknown\").strip()    \n        hub_url = block.get(\"hub_url\") or \"\"    \n        for ri, row in enumerate(block.get(\"rows\") or []):    \n            if not isinstance(row, dict):    \n                continue    \n            url = (row.get(\"url\") or \"\").strip()    \n            if not url or \"/organization/\" not in url.lower():    \n                continue    \n            title = (row.get(\"title\") or \"\").strip()    \n            slug = (row.get(\"slug\") or \"\").strip()    \n            company_name = title or (slug.replace(\"-\", \" \").title() if slug else \"\")    \n            if not company_name:    \n                continue    \n            records.append(    \n                {    \n                    \"id\": f\"hub:{hi}:{ri}\",    \n                    \"source\": \"crunchbase_hub\",    \n                    \"category\": category,    \n                    \"company_name\": company_name,    \n                    \"raw_name\": title or company_name,    \n                    \"url\": url,    \n                    \"domain\": \"www.crunchbase.com\",    \n                    \"hub_url\": hub_url,    \n                    \"position\": ri + 1,    \n                }    \n            )    \n    return records    \n\ndef build_raw_payload(snapshot_path: Path) -> Dict[str, Any]:    \n    raw = json.loads(snapshot_path.read_text(encoding=\"utf-8\"))    \n    if not isinstance(raw.get(\"hubs\"), list):    \n        raise ValueError(f\"{snapshot_path}: expected hub snapshot with 'hubs' array\")    \n    records = records_from_hub_snapshot(raw)    \n    return {    \n        \"extracted_at\": datetime.now(timezone.utc).isoformat(),    \n        \"snapshot\": str(snapshot_path.name),    \n        \"record_count\": len(records),    \n        \"records\": records,    \n    }    \n\ndef write_raw_records(snapshot_path: Path, out_path: Path) -> Dict[str, Any]:    \n    payload = build_raw_payload(snapshot_path)    \n    out_path.parent.mkdir(parents=True, exist_ok=True)    \n    out_path.write_text(    \n        json.dumps(payload, indent=2, ensure_ascii=False) + \"n\",    \n        encoding=\"utf-8\",    \n    )    \n    return payload\n```\n\nThe `id`\n\nfield (`hub:0:3`\n\n, `hub:2:11`\n\n, etc.) is our stable key that links each raw record to its canonical cluster in Stage 4. Deterministic, derivable from position, and most importantly, easy to debug.\n\n**Output:** `raw_records.json`\n\n— 96 rows, all `source: \"crunchbase_hub\"`\n\nfields, tagged by category.\n\n**Entity resolution reconciliation** (Stage 4) collapses duplicate company names into canonical clusters. In this dataset, 96 scraped rows become **88 canonical companies** after normalization and fuzzy clustering. Four names show up on more than one hub — **Callaghan Innovation** and **EISMEA** on all four leaderboards, **PayTic** and **SixThirty** on two — which gives duplicate rows before clustering.\n\nAfter exact normalization there are **88** distinct normalized names, which happens to be the same count as final clusters at WRatio threshold 90 — meaning *no additional fuzzy merges were needed beyond collapsing the cross-hub duplicates*.\n\nI run reconciliation in two passes.\n\nSee full code here for\n\nreconcile.py:[https://gist.github.com/sixthextinction/5c711e48353f4f7765e13cc4bb1b25de]\n\n```\n# reconcile.py  \n_LEGAL     = re.compile(  \n    r\"b(inc.?|llc.?|ltd.?|plc.?|corp.?|corporation|co.?|company|limited)b\",  \n    re.I,  \n)  \n_NON_ALNUM = re.compile(r\"[^ws]\", re.UNICODE)  \n\ndef normalize_company_name(s: str) -> str:  \n    s = s.lower().strip()  \n    s = _NON_ALNUM.sub(\" \", s)   # strip punctuation  \n    s = _LEGAL.sub(\" \", s)       # drop legal suffixes  \n    s = re.sub(r\"s+\", \" \", s).strip()  \n    return s\n```\n\nAfter normalization, I group records by their normalized string. `\"Lovable\"`\n\n, `\"lovable\"`\n\n, and `\"Lovable.\"`\n\nall collapse into the same group. This removes trivial duplicates *before* the more expensive fuzzy-matching pass. **TL;DR: Do the cheap pass first, expensive pass second** — same reason you’d put a `WHERE`\n\nclause *before* a `JOIN`\n\n.\n\nFor this dataset that’s **96 hash inserts** — one `normalize_company_name()`\n\n+ one dict lookup per row — roughly **~O(n)**.\n\nThe important optimization is that normalization shrinks the search space *before* the quadratic fuzzy pass runs. Without Pass 1, naïve all-pairs fuzzy matching over `n = 10,000`\n\nunique names would require:\n\nn(n−1)/2 ≈ 50 million comparisons\n\nIt takes my laptop ~**1.3 µs** per RapidFuzz `WRatio`\n\ncall on ~30-character names, so that pushes our runtime toward ~60 seconds instead of milliseconds. Not ideal — which is exactly why Pass 1 exists, to reduce `n`\n\nbefore the O(n²) step becomes expensive.\n\nI then compare each exact group against existing clusters using **WRatio** from RapidFuzz.\n\n``` python\n# reconcile.py  \nfrom rapidfuzz import fuzz  \n_FUZZY_SCORER = fuzz.WRatio  \n\ndef _fuzzy_merge_groups(    groups: List[List[Dict[str, Any]]],  \n    threshold: float,           # default: 90.0) -> List[Cluster]:  \n    clusters: List[Cluster] = []  \n    for group in sorted(groups, key=lambda g: (  \n        min(_source_rank(m.get(\"source\") or \"\") for m in g),  \n        -len(g),  \n    )):  \n        rep = pick_canonical_name(group)  \n        placed = False  \n        for cluster in clusters:  \n            if _FUZZY_SCORER(rep, cluster.canonical_name) >= threshold:  \n                cluster.members.extend(group)  \n                cluster.canonical_name = pick_canonical_name(cluster.members)  \n                cluster.canonical_id   = make_canonical_id(cluster.canonical_name)  \n                placed = True  \n                break  \n        if not placed:  \n            clusters.append(Cluster(  \n                canonical_id=make_canonical_id(rep),  \n                canonical_name=rep,  \n                members=list(group),  \n            ))  \n    return clusters\n```\n\nHere, we have to compare each group’s representative against existing cluster canonicals. So the worst case with `g = 88`\n\nexact groups would be\n\n0 + 1 + 2 + … + 87 = 3,828 comparisons\n\nThat’s roughly ~O(g²).\n\nRapidFuzz ships several scorers — see the [rapidfuzz.fuzz docs](https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html) for the full list. We use [fuzz.WRatio](https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#wratio) (weighted ratio; same algorithm family as [FuzzyWuzzy’s WRatio](https://github.com/seatgeek/fuzzywuzzy)) because company names drift in different ways and no single metric covers all of them.\n\n[WRatio](https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#wratio) is a **meta-scorer**: for each pair of strings it runs several ratio algorithms internally (with length-based weighting) and returns the best score. It combines:\n\n`Necker FinTech`\n\nvs `Necker FinTech Holdings Inc.`\n\nlooks like a poor match).`Inc.`\n\nor `Group`\n\n.You rarely know in advance *which* kind of drift a CRM row will have — suffix appended, spacing changed, words reordered. **WRatio picks the strategy that scores highest for that specific pair**, which is exactly what you want for entity resolution on names alone.\n\nWe default to **threshold 90**: strict enough that unrelated pairs (`Stripe`\n\nvs `Climate Corp`\n\n) stay out, loose enough that real variants (`PointsKash`\n\nvs `Points Kash`\n\n) merge. Tune it on your data.\n\nOn this dataset specifically, WRatio handles the drift patterns we actually see in company names (or historically have, anyway):\n\n| Hub / scraped name | CRM variant (in `sample_crm.json` ) |\nDrift type |\n|---|---|---|\n`Necker FinTech` |\n`Necker FinTech Holdings Inc.` |\nLegal suffix + spacing (`Fin Tech` vs `FinTech` ) |\n`PANTA` |\n`PANTA Group` |\nType descriptor appended |\n`Physical Intelligence` |\n`Physical Intelligence (Pi), Inc.` |\nParenthetical + legal suffix |\n`PointsKash` |\n`Points Kash` |\nToken spacing |\n`qBotica` |\n`q Botica` |\nToken spacing |\n\nPure `ratio`\n\n(edit distance) would heavily penalize `Necker FinTech`\n\nvs `Necker FinTech Holdings Inc.`\n\nbecause three extra words add significant distance. So`partial_ratio`\n\nhandles containment and`token_set_ratio`\n\nhandles reordering. **WRatio picks the strategy that produces the best score for each specific pair** — which is exactly the behavior you want when you don't know in advance *how* a name is going to drift.\n\nDisplay names with **no shared tokens** to the legal entity — e.g. hub title `Investing.com`\n\nvs operator `Fusion Media Limited`\n\n, or `Lyrie.ai`\n\nvs `OTT Cybersecurity Inc.`\n\n— stay below threshold. WRatio correctly refuses to merge them. That’s a good thing — those belong in a lookup table or enrichment API, not in a string-similarity pass (see Caveats).\n\nOne last thing before we move on to the demo — every time a cluster gains new members, I re-evaluate its canonical name. The source ranking (`crunchbase_hub`\n\n= 0, anything else = 99) ensures that short, clean display names win over longer legal variants:\n\n``` php\ndef pick_canonical_name(members: Sequence[Dict[str, Any]]) -> str:  \n    def sort_key(m):  \n        name = (m.get(\"company_name\") or \"\").strip()  \n        return (_source_rank(m.get(\"source\") or \"\"), len(name), name.lower())  \n    return min(members, key=sort_key)[\"company_name\"].strip()\n```\n\nA Crunchbase display name like `\"Lovable\"`\n\nwill always beat `\"Lovable Technologies Inc.\"`\n\nas the canonical — it's from a trusted source *and* it's shorter. The legal variant ends up as an alias, which is exactly the right relationship.\n\n**Output of this stage:** `reconciled.json`\n\n— 88 canonical clusters, alias mappings with WRatio scores, and CRM join metrics.\n\nThat’s it, we’re all done with the fuzzy pipeline. Let’s see if that improved things.\n\nOur `sample_crm.json`\n\nsimulates the data you’d get from a real CRM — I simply researched legal names and known alternate spellings online for the companies I had, and put it in a JSON file. **This gave me 138 rows representing the same 88 canonical companies.**\n\nSome companies had one exact-match entry — these are easy for us to handle. Others had three or four variants that I’d name like this:\n\n```\n{ \"id\": \"crm:necker_fintech_0\", \"company_name\": \"Necker Fin Tech\" },  \n{ \"id\": \"crm:necker_fintech_1\", \"company_name\": \"Necker FinTech Group\" },  \n{ \"id\": \"crm:necker_fintech_2\", \"company_name\": \"Necker FinTech Holdings Inc.\" }\n```\n\nOur join logic in the`post_fuzzy_eval.py`\n\ndemo runs exact normalization first, then falls back to fuzzy — note how this is the same “cheap pass first” pattern as the cluster builder:\n\n**post_fuzzy_eval.py**\n\n```\n\"\"\"Optional CRM join evaluation — exact vs fuzzy match rates (not part of core reconcile).\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nfrom pathlib import Path\nfrom typing import Any, Dict, List, Optional, Sequence\n\nfrom rapidfuzz import fuzz\n\nfrom reconcile import (\n    Cluster,\n    DEFAULT_THRESHOLD,\n    _exact_groups,\n    normalize_company_name,\n)\n\n_FUZZY_SCORER = fuzz.WRatio\n\ndef load_crm(path: Path) -> List[Dict[str, Any]]:\n    raw = json.loads(path.read_text(encoding=\"utf-8\"))\n    if isinstance(raw, list):\n        rows = raw\n    elif isinstance(raw, dict) and \"companies\" in raw:\n        rows = raw[\"companies\"]\n    else:\n        raise ValueError(f\"{path}: expected list or {{'companies': [...]}}\")\n    out: List[Dict[str, Any]] = []\n    for i, row in enumerate(rows):\n        if not isinstance(row, dict):\n            continue\n        name = (row.get(\"company_name\") or \"\").strip()\n        if not name:\n            continue\n        out.append(\n            {\n                \"id\": row.get(\"id\") or f\"crm:{i}\",\n                \"company_name\": name,\n            }\n        )\n    return out\n\ndef _record_to_cluster_map(clusters: Sequence[Cluster]) -> Dict[str, str]:\n    out: Dict[str, str] = {}\n    for cluster in clusters:\n        for m in cluster.members:\n            out[m[\"id\"]] = cluster.canonical_id\n    return out\n\ndef crm_to_canonical(\n    crm_rows: Sequence[Dict[str, Any]],\n    clusters: Sequence[Cluster],\n    threshold: float,\n) -> Dict[str, Optional[str]]:\n    out: Dict[str, Optional[str]] = {}\n    for row in crm_rows:\n        key = str(row.get(\"id\") or row.get(\"company_name\"))\n        name = (row.get(\"company_name\") or \"\").strip()\n        if not name:\n            out[key] = None\n            continue\n        norm = normalize_company_name(name)\n        matched: Optional[str] = None\n        for cluster in clusters:\n            if any(\n                normalize_company_name(m.get(\"company_name\") or \"\") == norm\n                for m in cluster.members\n            ):\n                matched = cluster.canonical_id\n                break\n        if not matched:\n            best_score = 0.0\n            best_id: Optional[str] = None\n            for cluster in clusters:\n                score = _FUZZY_SCORER(name, cluster.canonical_name)\n                if score > best_score:\n                    best_score = score\n                    best_id = cluster.canonical_id\n            matched = best_id if best_score >= threshold else None\n        out[key] = matched\n    return out\n\ndef join_metrics(\n    records: Sequence[Dict[str, Any]],\n    crm_rows: Sequence[Dict[str, Any]],\n    clusters: Sequence[Cluster],\n    threshold: float,\n) -> Dict[str, Any]:\n    record_to_cid = _record_to_cluster_map(clusters)\n    crm_to_cid = crm_to_canonical(crm_rows, clusters, threshold)\n\n    crm_norms = {\n        normalize_company_name((r.get(\"company_name\") or \"\"))\n        for r in crm_rows\n        if normalize_company_name(r.get(\"company_name\") or \"\")\n    }\n    crm_mapped_cids = {v for v in crm_to_cid.values() if v}\n\n    scraped_exact = 0\n    scraped_fuzzy = 0\n    for r in records:\n        norm = normalize_company_name(r.get(\"company_name\") or \"\")\n        if norm in crm_norms:\n            scraped_exact += 1\n        cid = record_to_cid.get(r[\"id\"])\n        if cid and cid in crm_mapped_cids:\n            scraped_fuzzy += 1\n\n    crm_exact = 0\n    crm_fuzzy = 0\n    scraped_norms = {\n        normalize_company_name(r.get(\"company_name\") or \"\") for r in records\n    }\n    scraped_cids = set(record_to_cid.values())\n    for row in crm_rows:\n        norm = normalize_company_name(row.get(\"company_name\") or \"\")\n        if norm in scraped_norms:\n            crm_exact += 1\n        cid_key = str(row.get(\"id\") or row.get(\"company_name\"))\n        cid = crm_to_cid.get(cid_key)\n        if cid and cid in scraped_cids:\n            crm_fuzzy += 1\n\n    n_scraped = len(records) or 1\n    n_crm = len(crm_rows) or 1\n    return {\n        \"scraped_rows\": len(records),\n        \"crm_rows\": len(crm_rows),\n        \"canonical_clusters\": len(clusters),\n        \"exact_normalized_unique\": len(_exact_groups(records)),\n        \"scraped_exact_join_pct\": round(100.0 * scraped_exact / n_scraped, 1),\n        \"scraped_fuzzy_join_pct\": round(100.0 * scraped_fuzzy / n_scraped, 1),\n        \"crm_exact_join_pct\": round(100.0 * crm_exact / n_crm, 1),\n        \"crm_fuzzy_join_pct\": round(100.0 * crm_fuzzy / n_crm, 1),\n    }\n\ndef eval_crm_join(\n    records: Sequence[Dict[str, Any]],\n    clusters: Sequence[Cluster],\n    crm_path: Path,\n    threshold: float = DEFAULT_THRESHOLD,\n) -> Dict[str, Any]:\n    \"\"\"Load CRM file and compute join metrics against existing clusters.\"\"\"\n    crm_rows = load_crm(crm_path)\n    return join_metrics(records, crm_rows, clusters, threshold)\n```\n\n**Here’s how we measure this JOIN operation** (`join_metrics`\n\nin `post_fuzzy_eval.py`\n\n):\n\n`company_name`\n\nappears in the set of normalized CRM names.So how did we do?\n\n| Question | Exact match | Fuzzy (WRatio ≥ 90) |\n|---|---|---|\nOf 96 scraped rows, how many link to a CRM row? |\n58.3% (56 rows) |\n100% (96 rows) |\nOf 138 CRM rows, how many link back to scraped data? |\n34.8% (48 rows) |\n100% (138 rows) |\n\nThe 58.3% exact baseline isn’t actually bad — over half of raw hub titles normalize to a CRM string exactly. The other 41.7% however, absolutely need fuzzy matching via WRatio because the CRM holds legal or alternate spellings (`Necker FinTech Holdings Inc.`\n\nvs hub `Necker FinTech`\n\n, etc.) that no amount of lowercasing or other normalization will save you from.\n\nThe fuzzy pass closes the gap on this dataset at WRatio threshold 90. **WRatio is strict enough to avoid merging unrelated names while still picking up suffix and token drift** — which is fantastic — just what we want!\n\nCommands below assume Python 3.10+ and a venv. All of this runs locally; the only network calls are to Bright Data during the initial fetch.\n\n```\n# Install deps  \npip install rapidfuzz requests python-dotenv  \n\n# Fetch all 4 hubs (costs API credits)  \npython fetch_hubs.py  \n\n# Already have cached responses? Re-parse for free  \npython fetch_hubs.py --parse-only  \n\n# Extract + reconcile + print CRM metrics (default: both stages)  \npython run_fuzzy.py  \n\n# Regenerate sample_crm.json from raw_records (optional)  \npython build_sample_crm.py  \n\n# Tune the threshold (try 85 for more aggressive merging)  \npython run_fuzzy.py --threshold 85  \n\n# Run individual stages  \npython run_fuzzy.py --extract  \npython run_fuzzy.py --reconcile\n```\n\nSample CLI output after a full run:\n\n```\nwrote data/raw_records.json  \n  records: 96  \n  category artificial_intelligence: 24  \n  category cybersecurity: 24  \n  category fintech: 26  \n  category saas: 22  \n\nwrote data/reconciled.json  \n\n-- join metrics (CRM) --  \n  scraped rows: 96 | exact-normalized unique: 88 | canonical clusters: 88  \n  scraped -> CRM  exact: 58.3% | fuzzy: 100.0%  \n  CRM -> scraped   exact: 34.8% | fuzzy: 100.0%  \n\n-- top 10 canonicals (by alias count) --  \n  Callaghan Innovation  (4 aliases, sources: crunchbase_hub)  \n  EISMEA  (4 aliases, sources: crunchbase_hub)  \n  PayTic  (2 aliases, sources: crunchbase_hub)  \n  SixThirty  (2 aliases, sources: crunchbase_hub)  \n  ...more\n```\n\nI’ve also added a diagnostic queue into the pipeline for low-confidence alias assignments — records whose WRatio against their cluster’s canonical falls *below* the threshold. This will show us merges that look suspicious and deserve a human eye:\n\n``` python\n# reconcile.py  \ndef review_queue(  \n    records: Sequence[Dict[str, Any]],  \n    clusters: Sequence[Cluster],  \n    threshold: float,  \n    limit: int = 8,  \n) -> List[Tuple[float, str, str, str]]:  \n    rid_to_cluster = {m[\"id\"]: c for c in clusters for m in c.members}  \n    lows = []  \n    for r in records:  \n        c     = rid_to_cluster.get(r[\"id\"])  \n        name  = r.get(\"company_name\") or \"\"  \n        score = _FUZZY_SCORER(name, c.canonical_name)  \n        if score < threshold:  \n            lows.append((score, name, c.canonical_name, c.canonical_id))  \n    lows.sort(key=lambda x: x[0])  \n    return lows[:limit]\n```\n\nIn production this would feed a human-review UI or write to a `needs_review`\n\ntable. Here it just prints to stdout — but my point stands: **fuzzy matching isn't a black box. You can always surface the borderline decisions and let a human confirm them.**\n\nThat’s everything, thanks for reading!\n\n**Q: Do you need ML or vector embeddings for company name matching?**\n\n**A:** No, not for stylistic drift (legal suffixes, spacing, punctuation). Our pipeline uses RapidFuzz [fuzz.WRatio](https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#wratio) —which is a rule-based string similarity, not a trained model.\n\n**Q: What similarity threshold should you use with WRatio?**\n\n**A:** Start at **WRatio threshold 90**. At 90, unrelated pairs like `Stripe`\n\nvs `Climate Corp`\n\nscore 45.0 and stay out, while suffix/spacing variants like `Necker FinTech`\n\nvs `Necker FinTech Holdings Inc.`\n\nscore 90.0+ and merge. See the [score_cutoff](https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#wratio) parameter in the docs if you want early-exit optimization.\n\n**Q: When does fuzzy matching fail for company names?**\n\n**A:** When names share almost no tokens — e.g. brand `Investing.com`\n\nvs legal entity `Fusion Media Limited`\n\n(WRatio 30.0). Use a lookup table, domain, LEI, or enrichment API instead.\n\n**Q: Why not join on** `company_name`\n\n**in SQL?**\n\n**A:** Because raw name joins will often miss legal variants. Resolve each row to a `canonical_id`\n\nin Python, load clusters into Postgres, and only then can you safely do a `JOIN ... USING (canonical_id)`\n\n.\n\nI should clear some things up about this tutorial.\n\n`ARYZE ApS`\n\n, `Count Finance LTD`\n\n, `PANTA Group`\n\n. A real CRM would actually be dirtier: misspellings, stale names, entries from multiple import sources with inconsistent formatting. In practice the fuzzy pass may not hit 100%, but it'll still get you much closer than exact matching does.`Investing.com`\n\n/ `Fusion Media Limited`\n\nor `Lyrie.ai`\n\n/ `OTT Cybersecurity Inc.`\n\nshare almost no tokens, so WRatio stays low and that's `slug → legal_name`\n\nmap. Fuzzy matching handles stylistic drift on the The normalize → exact-group → fuzzy-cluster → CRM join pattern I’ve described here applies directly to:\n\n`Acme Corp`\n\n, `Acme Corporation`\n\n, and `ACME`\n\nbefore they become three separate accounts in your sales pipeline.**That WRatio threshold is something you should play around with.** At WRatio threshold 90 (the default in this pipeline), clearly unrelated pairs stay out (`Stripe`\n\nvs `Climate Corp`\n\nscores 45.0) while suffix and spacing drift gets in. Drop to 80 and you'll catch more variants but start seeing false positives. This will differ based on your dataset, obviously, and the review queue is your safety net either way.\n\n**Next step in production:** load `reconciled.json`\n\ninto Postgres, resolve each CRM row to a `canonical_id`\n\n(same logic as `_crm_to_canonical`\n\nin Python), then join on that key instead of `company_name`\n\n.\n\n```\n-- Tables loaded from pipeline output (reconciled.json + raw_records + sample_crm)  \nCREATE TABLE canonicals (  \n  canonical_id   TEXT PRIMARY KEY,  \n  canonical_name TEXT NOT NULL  \n);  \n\nCREATE TABLE entity_aliases (  \n  canonical_id TEXT NOT NULL REFERENCES canonicals (canonical_id),  \n  alias_name   TEXT NOT NULL,  \n  source       TEXT,  \n  match_score  NUMERIC,  \n  PRIMARY KEY (canonical_id, alias_name)  \n);  \n\nCREATE TABLE hub_scrape (  \n  id            TEXT PRIMARY KEY,  \n  company_name  TEXT NOT NULL,  \n  canonical_id  TEXT REFERENCES canonicals (canonical_id),  \n  category      TEXT,  \n  url           TEXT  \n);  \n\nCREATE TABLE crm_accounts (  \n  id            TEXT PRIMARY KEY,  \n  company_name  TEXT NOT NULL,  \n  canonical_id  TEXT REFERENCES canonicals (canonical_id)  -- from Python CRM mapping  \n);  \n\n-- Broken: join on raw company_name  \nSELECT COUNT(*) AS matched_rows  \nFROM   hub_scrape h  \nJOIN   crm_accounts c ON c.company_name = h.company_name;  \n-- 56 / 96 (~58%) on this dataset  \n\n-- Fixed: join on canonical_id (assigned during ETL from reconciled.json)  \nSELECT h.company_name AS hub_name,  \n       c.company_name AS crm_name,  \n       h.canonical_id  \nFROM   hub_scrape h  \nJOIN   crm_accounts c USING (canonical_id)  \nWHERE  h.company_name = 'Necker FinTech';  \n-- hub_name: Necker FinTech  \n-- crm_name: Necker FinTech Holdings Inc.  (or Necker Fin Tech, etc.)  \n-- canonical_id: c_necker_fintech\n```\n\n`canonicals`\n\nand `entity_aliases`\n\nfrom `reconciled.json`\n\n.`hub_scrape.canonical_id`\n\nfrom the `aliases`\n\narray (`id`\n\n→ `canonical_id`\n\n).`crm_accounts.canonical_id`\n\nwith the same `_crm_to_canonical`\n\nlogic you already run in Python (exact norm match, then WRatio ≥ 90).After that, SQL stays a plain equi-join — **fuzzy matching happens once upstream, and not inside the database.** I won’t cover that though; the *pattern* is the point, not the warehouse you choose to use.\n\nNone of this is new — entity resolution is a well-studied problem with industrial-strength tools (Dedupe, [Splink](https://moj-analytical-services.github.io/splink/demos/examples/duckdb/deterministic_dedupe.html), various record linkage toolkits) when you need them. **But for the common case of “I have two lists of company names and I need to join them,” you really don’t.** A normalization pass and a WRatio threshold gets you most of the way there in an afternoon, in pure Python, with zero infrastructure.", "url": "https://wpnews.pro/news/a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning", "canonical_source": "https://dev.to/prithwish_nath/a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning-3pnl", "published_at": "2026-05-26 07:36:44+00:00", "updated_at": "2026-05-26 08:04:35.059193+00:00", "lang": "en", "topics": ["machine-learning", "natural-language-processing", "ai-tools"], "entities": ["Crunchbase", "RapidFuzz", "Bright Data", "Necker FinTech", "Necker FinTech Holdings Inc.", "Investing.com", "Fusion Media Limited"], "alternates": {"html": "https://wpnews.pro/news/a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning", "markdown": "https://wpnews.pro/news/a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning.md", "text": "https://wpnews.pro/news/a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning.txt", "jsonld": "https://wpnews.pro/news/a-practical-guide-to-entity-resolution-in-python-no-database-no-machine-learning.jsonld"}}