{"slug": "find-dup-defs-find-duplicated-python-code-at-the-speed-of-light", "title": "Find-dup-defs – find duplicated Python code at the speed of light", "summary": "A new open-source tool, find-dup-defs, identifies duplicated and near-duplicated Python code using structural AST analysis and a normalized \"Thickness\" score to rank refactoring urgency. The tool claims to be 2-12 times faster than existing solutions like PMD CPD and jscpd, and it auto-calibrates project-specific noise filters on its first run. This release follows a GitClear report analyzing 211 million lines of code, which found that copy-pasted lines grew from 8.3% to 12.3% of all changes from 2021 to 2024, while refactored lines dropped from 25% to under 10%, marking the first time copy/paste exceeded code reuse.", "body_md": "**Find the copy-pasted code your linter can't — and tell you which copies to refactor first.**\n\nDuplicate & near-duplicate definitions — functions, **methods**, classes, constants, `type`\n\naliases —\nclustered by structural AST canonicalization, ranked by a normalized **Thickness** score, graded\nERROR / WARNING / INFO, with **auto-suggested project-specific noise filters** out of the box.\n\n**2-12× faster than PMD CPD / jscpd. Calibrates itself on first run.**\n\n[GitClear's 2025 report](https://www.gitclear.com/ai_assistant_code_quality_2025_research) (211M lines\nof code analyzed): **copy-pasted lines grew from 8.3% to 12.3%** of all changes 2021→2024, while\nrefactored lines **dropped from 25% to under 10%**. For the first time in measurable history,\ncopy/paste exceeded code reuse. AI assistants don't know your project's `_helper.py`\n\n— they emit the\ncopy.\n\n`find-dup-defs`\n\nis the gate.\n\n``` bash\n$ find-dup-defs ./src --calibrate\n=== thickness calibration (ERROR): 76 clusters analyzed ===\n\ndistribution (each ▇ ≈ one ERROR cluster, scaled to fit):\n  T [0.0, 0.1)\n  T [0.1, 0.2)  ▇▇▇▇ 2\n  T [0.2, 0.3)  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 25\n  T [0.3, 0.4)  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 27\n  T [0.4, 0.5)  ▇▇▇▇▇▇▇▇▇ 8\n  T [0.5, 0.6)  ▇▇▇▇▇▇▇▇▇ 8\n  T [0.6, 0.7)  ▇▇▇▇▇ 4\n  T [0.7, 0.8)  ▇▇▇ 2\n\nsuggested thresholds (p50/p75/p90):\n\n  permissive   --error-thickness 0.27  →  38 ERROR remain  (median dup: 8 loc, 2 args)\n    e.g. ASTConverter.visit_Try/ASTConverter.visit_TryStar  [T=0.27, loc=16, args=2]\n         mypy/fastparse.py:1367, :1384\n    ┌──\n    │ def visit_Try(self, n: ast3.Try) -> TryStmt:\n    │     vs = [self.set_line(NameExpr(h.name), h) if h.name is not None else None\n    │           for h in n.handlers]\n    │     types = [self.visit(h.type) for h in n.handlers]\n    │     handlers = [self.as_required_block(h.body) for h in n.handlers]\n    │     node = TryStmt(...)\n    │     return self.set_line(node, n)\n    └──\n\n=== inferred directives (auto-detected noise patterns) ===\n\n  → -D 'de-escalate:*:*@*tests/*=test parametrize/fixture candidates'\n    rationale: 21 clusters live entirely in test paths\n    affects: 21 total (10 ERROR, 11 WARNING, 0 INFO)\n```\n\nOne calibrate call gives you:\n\n- Histogram of finding \"thickness\" (refactor urgency)\n- Three suggested thresholds (\n`permissive`\n\n/`balanced`\n\n/`strict`\n\n) with**a code sample at each level** - Auto-detected noise patterns as ready-to-paste\n`-D`\n\ndirectives\n\nThen your CI:\n\n```\nfind-dup-defs ./src \\\n  --error-thickness 0.5 --warning-thickness 0.4 --escalate-thickness 0.55 \\\n  -D 'de-escalate:*:*@*tests/*=test fixtures' \\\n  --errors-only --json\n```\n\n…exits non-zero only on actionable refactor candidates.\n\n```\ncargo install find-dup-defs\n```\n\n…or grab a prebuilt binary from the [Releases page](https://github.com/prostomarkeloff/find-dup-defs/releases/latest).\n\nEvery `.py`\n\nfile is parsed **once** (Ruff parser, PEP 695 / 701 ready), each callable yielded as top-level functions **and class methods** (`Foo.bar`\n\n, `Foo.Inner.baz`\n\n):\n\n**name-gated**— same-`(kind, name)`\n\ndefs clustered by exact Ratcliff–Obershelp similarity on the`ast.dump`\n\n-shape canonical.**cross-name**— renamed copy-paste: alpha-renamed canonical bucketed, ≥2 distinct names across ≥2 sites.** Type-3**(ECScan) — IDF-weighted cosine over name-agnostic lines; catches edited renamed copies the exact pass misses.\n\n`@overload`\n\n/`@abstractmethod`\n\n/ Protocol stubs — bodies of`...`\n\n/`pass`\n\n/ docstring filtered at extraction`raise NotImplementedError`\n\nABC declarations`return False / None / 0 / \"x\"`\n\ndispatch overrides (huge cross-name FP source — gone)`@property`\n\n+`@x.setter`\n\n/`.deleter`\n\n— accessor role baked into the name`self`\n\n/`cls`\n\nreceivers — stripped so methods can match equivalent free functions\n\nThree tiers, controllable via three thresholds:\n\n```\nERROR  ←→  WARNING  ←→  INFO\n  ↑           ↑          ↑\ngate       review     hidden by default (JSON-only, --show-info to display)\n```\n\n`--error-thickness X`\n\n— ERROR demotes to WARNING if T < X`--warning-thickness X`\n\n— WARNING demotes to INFO if T < X`--escalate-thickness X`\n\n— any cluster with T ≥ X is forced to ERROR (catches fat multi-copy patterns the name-gated heuristics demote by sim alone)\n\nA normalized [0, 1] \"GET ME REFACTORED\" score combining:\n\n**Dedup volume**=`(n_members - 1) × loc`\n\n— how many lines you'd actually delete (dominant signal)**Args**— wide signatures register as architecturally chunkier** Similarity**— higher confidence dups score higher\n\n```\nT = 0.7 · sat(volume, 30) + 0.1 · sat(args, 5) + 0.2 · sim\nsat(x, k) = 1 − exp(−x/k)\n```\n\nSort findings by T → biggest refactor wins first.\n\nUser-authored overrides for repo-specific intentional duplication.\n\n```\nACTION : [KIND:] NAME [@PATH] [=NOTE]\n```\n\n| Action | Effect | Severity |\n|---|---|---|\n`suppress` |\nDrop entirely | gone |\n`de-escalate` |\nOne step down | ERROR→WARNING→INFO |\n`escalate` |\nOne step up | INFO→WARNING→ERROR |\n`note` |\nAnnotate only | unchanged |\n\n```\n# Plugin no-op API: intentional, don't gate\n-D 'de-escalate:METHOD:Plugin.get_*_hook=intentional plugin no-op API'\n\n# Bootstrap copy that can't be deduplicated\n-D 'suppress:FUNCTION:spawn@*mypyc/lib-rt/*=bootstrap copy: lib-rt cannot import from mypyc'\n\n# Architectural blocker — escalate to ERROR even if thickness is mid-range\n-D 'escalate:METHOD:Lock.*@*/storage/*=Lock/LockExtend must share impl before v1.0'\n\n# Just leave a note\n-D 'note:METHOD:For*.begin_body=v2 refactor target (see issue #42)'\n```\n\nNotes show up inline:\n\n```\nDUPLICATE METHOD [ERROR]: Lock.hold/LockExtend.hold  [normalized-exact, T=0.67, n=2, loc=28]\n  # Lock/LockExtend must share impl before v1.0\n```\n\nThe `--calibrate`\n\nstep pattern-matches across findings and surfaces ready-to-paste directives for repeating noise patterns. No manual config; just paste suggestions you agree with into CI.\n\n| Pattern | Suggestion |\n|---|---|\n≥5 CONSTANT clusters where ≥80% members are in `*/locale*` |\n`suppress:CONSTANT:*@*locale*` |\n| ≥10 clusters where all members live in test paths | `de-escalate:*:*@*tests/*` |\n≥3 clusters touching `*_pb2*` / `*_grpc*` files |\n`suppress:*:*@*_pb2*` |\n≥3 clusters all under `*/migrations/*` / `*/alembic/versions/*` |\n`suppress:*:*@*migrations/*` |\n≥5 clusters all under `*/docs_src/*` / `*/examples/*` / `*/tutorial/*` |\n`de-escalate:*:*@*docs_src/*` |\n\nVerified on real benchmark: **69% noise reduction** across 14 repositories ≥150K Python SLOC each.\n\n`hyperfine --warmup 1 --runs 3`\n\non macOS arm64 (M-series), against [ jscpd@4](https://github.com/kucherenko/jscpd) and\n\n[PMD CPD 7.24](https://pmd.github.io/)— both Python-mode, same target tree:\n\n| Repo (Python files) | find-dup-defs | PMD CPD | jscpd |\n|---|---|---|---|\n`pip` (633) |\n0.18 s |\n0.87 s (4.9×) | 3.21 s (18.2×) |\n`mypy/mypy` (155) |\n0.18 s |\n0.81 s (4.6×) | 1.47 s (8.4×) |\n`sympy` (1 589) |\n1.22 s |\n4.29 s (3.5×) | 15.18 s (12.4×) |\n`django` (2 910) |\n1.01 s |\n2.08 s (2.1×) | 9.67 s (9.6×) |\n\nPMD ran with `--minimum-tokens=100`\n\n; jscpd with defaults (min-lines=5, min-tokens=50). `find-dup-defs`\n\ndoes **more semantic work** (alpha-renamed canonical, IDF cosine, severity grading, calibration) and is still **3-12× faster** end-to-end — Rust + rayon-parallel extraction, single-parse Ruff frontend, no JVM/Node startup tax.\n\nOn `django`\n\n(426K SLOC, 2 910 files):\n\n`find-dup-defs`\n\n:**~422K SLOC/sec**- PMD CPD: ~205K SLOC/sec\n- jscpd: ~44K SLOC/sec\n\nAcross 14 repos with ≥150K SLOC + 14 with 50K-150K each (≈8M SLOC total), `find-dup-defs --calibrate`\n\nauto-applied directives reduce raw ERROR count by **67% on average**:\n\n| Repo | Raw ERROR | After CI flags + auto-inferred | %cut | Top remaining cluster |\n|---|---|---|---|---|\n| django/django | 559 | 71 | 87% | `TupleGreaterThan.get_fallback_sql` (n=4 SQL ops) |\n| wagtail/wagtail | 496 | 65 | 86% | `set_privacy` (n=2) |\n| apache/airflow | 2203 | 337 | 84% | `CloudComposerGetEnvironmentOperator` (n=18) |\n| home-assistant/core | 4475 | 850 | 81% | `ConfigFlow.async_step_*` (n=178) |\n| pandas-dev/pandas | 406 | 78 | 80% | `read_csv/read_table` (n=2) |\n| scipy/scipy | 492 | 140 | 71% | `dct/dst/idct/idst` (n=4) |\n| numpy/numpy | 316 | 96 | 69% | `std/var` (n=2) |\n\nTop findings on this corpus are textbook PR candidates:\n\nVersion`pip`\n\n`__lt__/__le__/__eq__/__ge__/__gt__`\n\n× 6 — minus 130 lines via one`_compare`\n\nhelper`scipy`\n\n`dct/dst/idct/idst`\n\n× 4 — minus ~330 lines via factory generator`django`\n\n`TupleGreaterThan/...`\n\n× 4 — minus ~75 lines via a`TupleLookupMixin`\n\nmethod`scikit-learn`\n\n`BaseSGDClassifier._fit / BaseSGDRegressor._fit`\n\n— classic dupe between sibling estimators\n\nOne-shot workflow for autonomous refactor agents:\n\n```\n# 1. Calibrate (1-30s)\nfind-dup-defs ./repo --calibrate --json > calib.json\n\n# 2. Agent reads calib.json:\n#    - inferred_directives[]            → -D flags ready to paste\n#    - error.suggestions[].error_thickness → starting threshold\n#    - warning.suggestions[].error_thickness → WARNING threshold\n\n# 3. Full scan with agent's chosen tuning\nfind-dup-defs ./repo \\\n  --error-thickness <calib> --warning-thickness <calib> \\\n  $(jq -r '.inferred_directives[].directive | \"-D \\\"\" + . + \"\\\"\"' calib.json) \\\n  --errors-only --json > findings.json\n\n# 4. For each finding, agent has everything to write a PR:\n#    - groups[].snippet     — full source of one representative member\n#    - groups[].members[]   — every duplicate location (file:line)\n#    - groups[].thickness   — refactor priority\n#    - groups[].notes[]     — directive-attached annotations\n```\n\nNo file-system roundtrips needed — the snippet ships in JSON.\n\n``` bash\n$ find-dup-defs ./mypy\n--- duplicate functions (cross-file, AST sim warn=0.5 error=0.85) ---\nDUPLICATE FUNCTION [ERROR]: generate_hash_wrapper/generate_len_wrapper\n  [normalized-exact, T=0.63, n=2, loc=24, args=3]\n  mypyc/codegen/emitwrapper.py:546, :573\n\n--- duplicate methods (cross-file, ...) ---\nDUPLICATE METHOD [ERROR]: CallableType.formal_arguments/Parameters.formal_arguments\n  [normalized-exact, T=0.56, n=2, loc=19, args=2]\n  mypy/types.py:1992, :2341\n\n--- duplicate methods (cross-name, exact AST-normalized) ---\nDUPLICATE METHOD [ERROR]: For{Async,,Native}Iterable.begin_body\n  [normalized-exact, T=0.53, n=3, loc=9, args=1]\n  mypyc/irbuild/for_helpers.py:716, :778, :848\n\n# summary: 14 ERROR, 47 WARNING groups\nUSAGE:\n  find-dup-defs [OPTIONS] <PATHS>...\n\nTHICKNESS LADDER:\n  --error-thickness <F>      Demote ERROR → WARNING if T < this (default 0.0 = off)\n  --warning-thickness <F>    Demote WARNING → INFO  if T < this (default 0.0 = off)\n  --escalate-thickness <F>   Promote anything → ERROR if T ≥ this (default 0.0 = off)\n\nSIMILARITY (name-gated):\n  -t, --threshold <F>        Cluster floor (default 0.5)\n  -e, --error-threshold <F>  ERROR floor   (default 0.85)\n  --type3-theta <F>          Type-3 cosine threshold (default 0.7)\n\nFILTERS:\n  -D, --directive <S>        ACTION:[KIND:]NAME[@PATH][=NOTE], repeatable\n  --kinds <K1,K2,...>        functions,methods,classes,constants,type-aliases\n  --min-size <N>             Only clusters with ≥ N members (default 2)\n  --errors-only              Filter output to ERROR severity\n  --show-info                Include INFO in human report (default hidden)\n\nMODES:\n  --calibrate                Print histogram + suggestions + inferred directives\n  --json                     Machine-readable output\n  --repo-root <PATH>         Path-prefix for short paths in report (default .)\n\nOUTPUT:\n  --no-cross-name            Skip pass 2\n  --no-type3                 Skip pass 3\n```\n\nTwo crates, each useful on its own:\n\n| crate | role |\n|---|---|\n`find-dup-defs` |\n\n`py-canon`\n\n**Python frontend**— Ruff parse →`ast.dump`\n\n-shape canonical + def scanThe name is **language-agnostic on purpose**: Python is the current frontend; other languages can be added later as further frontend crates feeding the same CLI. The similarity engine is the exact Ratcliff–Obershelp port [ difflib-fast](https://github.com/prostomarkeloff/difflib-fast).\n\n**Python only** today (frontend crate per language; PRs welcome)**Type 4 (semantic equivalence, different syntax → same logic)**— not done; neural-network research territory** Token-level fine-grained duplication**(a 30-token sub-expression copy-pasted around) — out of scope; use jscpd / PMD CPD alongside if you need that** Calibration is heuristic**— formula constants (loc=20, args=5, weight 0.7/0.1/0.2) were tuned on the 28-repo benchmark; your codebase may want different\n\n**Copy-paste has nowhere to hide.**\n\nMade with ⚡ by [@prostomarkeloff](https://github.com/prostomarkeloff)", "url": "https://wpnews.pro/news/find-dup-defs-find-duplicated-python-code-at-the-speed-of-light", "canonical_source": "https://github.com/prostomarkeloff/find-dup-defs", "published_at": "2026-05-26 09:57:22+00:00", "updated_at": "2026-05-26 10:10:35.058987+00:00", "lang": "en", "topics": ["ai-tools"], "entities": ["GitClear"], "alternates": {"html": "https://wpnews.pro/news/find-dup-defs-find-duplicated-python-code-at-the-speed-of-light", "markdown": "https://wpnews.pro/news/find-dup-defs-find-duplicated-python-code-at-the-speed-of-light.md", "text": "https://wpnews.pro/news/find-dup-defs-find-duplicated-python-code-at-the-speed-of-light.txt", "jsonld": "https://wpnews.pro/news/find-dup-defs-find-duplicated-python-code-at-the-speed-of-light.jsonld"}}