Find-dup-defs – find duplicated Python code at the speed of light

A new open-source tool, find-dup-defs, identifies duplicated and near-duplicated Python code using structural AST analysis and a normalized "Thickness" score to rank refactoring urgency. The tool claims to be 2-12 times faster than existing solutions like PMD CPD and jscpd, and it auto-calibrates project-specific noise filters on its first run. This release follows a GitClear report analyzing 211 million lines of code, which found that copy-pasted lines grew from 8.3% to 12.3% of all changes from 2021 to 2024, while refactored lines dropped from 25% to under 10%, marking the first time copy/paste exceeded code reuse.

Find the copy-pasted code your linter can't — and tell you which copies to refactor first. Duplicate & near-duplicate definitions — functions, methods , classes, constants, type aliases — clustered by structural AST canonicalization, ranked by a normalized Thickness score, graded ERROR / WARNING / INFO, with auto-suggested project-specific noise filters out of the box. 2-12× faster than PMD CPD / jscpd. Calibrates itself on first run. GitClear's 2025 report https://www.gitclear.com/ai assistant code quality 2025 research 211M lines of code analyzed : copy-pasted lines grew from 8.3% to 12.3% of all changes 2021→2024, while refactored lines dropped from 25% to under 10% . For the first time in measurable history, copy/paste exceeded code reuse. AI assistants don't know your project's helper.py — they emit the copy. find-dup-defs is the gate. bash $ find-dup-defs ./src --calibrate === thickness calibration ERROR : 76 clusters analyzed === distribution each ▇ ≈ one ERROR cluster, scaled to fit : T 0.0, 0.1 T 0.1, 0.2 ▇▇▇▇ 2 T 0.2, 0.3 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 25 T 0.3, 0.4 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 27 T 0.4, 0.5 ▇▇▇▇▇▇▇▇▇ 8 T 0.5, 0.6 ▇▇▇▇▇▇▇▇▇ 8 T 0.6, 0.7 ▇▇▇▇▇ 4 T 0.7, 0.8 ▇▇▇ 2 suggested thresholds p50/p75/p90 : permissive --error-thickness 0.27 → 38 ERROR remain median dup: 8 loc, 2 args e.g. ASTConverter.visit Try/ASTConverter.visit TryStar T=0.27, loc=16, args=2 mypy/fastparse.py:1367, :1384 ┌── │ def visit Try self, n: ast3.Try - TryStmt: │ vs = self.set line NameExpr h.name , h if h.name is not None else None │ for h in n.handlers │ types = self.visit h.type for h in n.handlers │ handlers = self.as required block h.body for h in n.handlers │ node = TryStmt ... │ return self.set line node, n └── === inferred directives auto-detected noise patterns === → -D 'de-escalate: : @ tests/ =test parametrize/fixture candidates' rationale: 21 clusters live entirely in test paths affects: 21 total 10 ERROR, 11 WARNING, 0 INFO One calibrate call gives you: - Histogram of finding "thickness" refactor urgency - Three suggested thresholds permissive / balanced / strict with a code sample at each level - Auto-detected noise patterns as ready-to-paste -D directives Then your CI: find-dup-defs ./src \ --error-thickness 0.5 --warning-thickness 0.4 --escalate-thickness 0.55 \ -D 'de-escalate: : @ tests/ =test fixtures' \ --errors-only --json …exits non-zero only on actionable refactor candidates. cargo install find-dup-defs …or grab a prebuilt binary from the Releases page https://github.com/prostomarkeloff/find-dup-defs/releases/latest . Every .py file is parsed once Ruff parser, PEP 695 / 701 ready , each callable yielded as top-level functions and class methods Foo.bar , Foo.Inner.baz : name-gated — same- kind, name defs clustered by exact Ratcliff–Obershelp similarity on the ast.dump -shape canonical. cross-name — renamed copy-paste: alpha-renamed canonical bucketed, ≥2 distinct names across ≥2 sites. Type-3 ECScan — IDF-weighted cosine over name-agnostic lines; catches edited renamed copies the exact pass misses. @overload / @abstractmethod / Protocol stubs — bodies of ... / pass / docstring filtered at extraction raise NotImplementedError ABC declarations return False / None / 0 / "x" dispatch overrides huge cross-name FP source — gone @property + @x.setter / .deleter — accessor role baked into the name self / cls receivers — stripped so methods can match equivalent free functions Three tiers, controllable via three thresholds: ERROR ←→ WARNING ←→ INFO ↑ ↑ ↑ gate review hidden by default JSON-only, --show-info to display --error-thickness X — ERROR demotes to WARNING if T < X --warning-thickness X — WARNING demotes to INFO if T < X --escalate-thickness X — any cluster with T ≥ X is forced to ERROR catches fat multi-copy patterns the name-gated heuristics demote by sim alone A normalized 0, 1 "GET ME REFACTORED" score combining: Dedup volume = n members - 1 × loc — how many lines you'd actually delete dominant signal Args — wide signatures register as architecturally chunkier Similarity — higher confidence dups score higher T = 0.7 · sat volume, 30 + 0.1 · sat args, 5 + 0.2 · sim sat x, k = 1 − exp −x/k Sort findings by T → biggest refactor wins first. User-authored overrides for repo-specific intentional duplication. ACTION : KIND: NAME @PATH =NOTE | Action | Effect | Severity | |---|---|---| suppress | Drop entirely | gone | de-escalate | One step down | ERROR→WARNING→INFO | escalate | One step up | INFO→WARNING→ERROR | note | Annotate only | unchanged | Plugin no-op API: intentional, don't gate -D 'de-escalate:METHOD:Plugin.get hook=intentional plugin no-op API' Bootstrap copy that can't be deduplicated -D 'suppress:FUNCTION:spawn@ mypyc/lib-rt/ =bootstrap copy: lib-rt cannot import from mypyc' Architectural blocker — escalate to ERROR even if thickness is mid-range -D 'escalate:METHOD:Lock. @ /storage/ =Lock/LockExtend must share impl before v1.0' Just leave a note -D 'note:METHOD:For .begin body=v2 refactor target see issue 42 ' Notes show up inline: DUPLICATE METHOD ERROR : Lock.hold/LockExtend.hold normalized-exact, T=0.67, n=2, loc=28 Lock/LockExtend must share impl before v1.0 The --calibrate step pattern-matches across findings and surfaces ready-to-paste directives for repeating noise patterns. No manual config; just paste suggestions you agree with into CI. | Pattern | Suggestion | |---|---| ≥5 CONSTANT clusters where ≥80% members are in /locale | suppress:CONSTANT: @ locale | | ≥10 clusters where all members live in test paths | de-escalate: : @ tests/ | ≥3 clusters touching pb2 / grpc files | suppress: : @ pb2 | ≥3 clusters all under /migrations/ / /alembic/versions/ | suppress: : @ migrations/ | ≥5 clusters all under /docs src/ / /examples/ / /tutorial/ | de-escalate: : @ docs src/ | Verified on real benchmark: 69% noise reduction across 14 repositories ≥150K Python SLOC each. hyperfine --warmup 1 --runs 3 on macOS arm64 M-series , against jscpd@4 https://github.com/kucherenko/jscpd and PMD CPD 7.24 https://pmd.github.io/ — both Python-mode, same target tree: | Repo Python files | find-dup-defs | PMD CPD | jscpd | |---|---|---|---| pip 633 | 0.18 s | 0.87 s 4.9× | 3.21 s 18.2× | mypy/mypy 155 | 0.18 s | 0.81 s 4.6× | 1.47 s 8.4× | sympy 1 589 | 1.22 s | 4.29 s 3.5× | 15.18 s 12.4× | django 2 910 | 1.01 s | 2.08 s 2.1× | 9.67 s 9.6× | PMD ran with --minimum-tokens=100 ; jscpd with defaults min-lines=5, min-tokens=50 . find-dup-defs does more semantic work alpha-renamed canonical, IDF cosine, severity grading, calibration and is still 3-12× faster end-to-end — Rust + rayon-parallel extraction, single-parse Ruff frontend, no JVM/Node startup tax. On django 426K SLOC, 2 910 files : find-dup-defs : ~422K SLOC/sec - PMD CPD: ~205K SLOC/sec - jscpd: ~44K SLOC/sec Across 14 repos with ≥150K SLOC + 14 with 50K-150K each ≈8M SLOC total , find-dup-defs --calibrate auto-applied directives reduce raw ERROR count by 67% on average : | Repo | Raw ERROR | After CI flags + auto-inferred | %cut | Top remaining cluster | |---|---|---|---|---| | django/django | 559 | 71 | 87% | TupleGreaterThan.get fallback sql n=4 SQL ops | | wagtail/wagtail | 496 | 65 | 86% | set privacy n=2 | | apache/airflow | 2203 | 337 | 84% | CloudComposerGetEnvironmentOperator n=18 | | home-assistant/core | 4475 | 850 | 81% | ConfigFlow.async step n=178 | | pandas-dev/pandas | 406 | 78 | 80% | read csv/read table n=2 | | scipy/scipy | 492 | 140 | 71% | dct/dst/idct/idst n=4 | | numpy/numpy | 316 | 96 | 69% | std/var n=2 | Top findings on this corpus are textbook PR candidates: Version pip lt / le / eq / ge / gt × 6 — minus 130 lines via one compare helper scipy dct/dst/idct/idst × 4 — minus ~330 lines via factory generator django TupleGreaterThan/... × 4 — minus ~75 lines via a TupleLookupMixin method scikit-learn BaseSGDClassifier. fit / BaseSGDRegressor. fit — classic dupe between sibling estimators One-shot workflow for autonomous refactor agents: 1. Calibrate 1-30s find-dup-defs ./repo --calibrate --json calib.json 2. Agent reads calib.json: - inferred directives → -D flags ready to paste - error.suggestions .error thickness → starting threshold - warning.suggestions .error thickness → WARNING threshold 3. Full scan with agent's chosen tuning find-dup-defs ./repo \ --error-thickness <calib --warning-thickness <calib \ $ jq -r '.inferred directives .directive | "-D \"" + . + "\""' calib.json \ --errors-only --json findings.json 4. For each finding, agent has everything to write a PR: - groups .snippet — full source of one representative member - groups .members — every duplicate location file:line - groups .thickness — refactor priority - groups .notes — directive-attached annotations No file-system roundtrips needed — the snippet ships in JSON. bash $ find-dup-defs ./mypy --- duplicate functions cross-file, AST sim warn=0.5 error=0.85 --- DUPLICATE FUNCTION ERROR : generate hash wrapper/generate len wrapper normalized-exact, T=0.63, n=2, loc=24, args=3 mypyc/codegen/emitwrapper.py:546, :573 --- duplicate methods cross-file, ... --- DUPLICATE METHOD ERROR : CallableType.formal arguments/Parameters.formal arguments normalized-exact, T=0.56, n=2, loc=19, args=2 mypy/types.py:1992, :2341 --- duplicate methods cross-name, exact AST-normalized --- DUPLICATE METHOD ERROR : For{Async,,Native}Iterable.begin body normalized-exact, T=0.53, n=3, loc=9, args=1 mypyc/irbuild/for helpers.py:716, :778, :848 summary: 14 ERROR, 47 WARNING groups USAGE: find-dup-defs OPTIONS <PATHS ... THICKNESS LADDER: --error-thickness <F Demote ERROR → WARNING if T < this default 0.0 = off --warning-thickness <F Demote WARNING → INFO if T < this default 0.0 = off --escalate-thickness <F Promote anything → ERROR if T ≥ this default 0.0 = off SIMILARITY name-gated : -t, --threshold <F Cluster floor default 0.5 -e, --error-threshold <F ERROR floor default 0.85 --type3-theta <F Type-3 cosine threshold default 0.7 FILTERS: -D, --directive <S ACTION: KIND: NAME @PATH =NOTE , repeatable --kinds <K1,K2,... functions,methods,classes,constants,type-aliases --min-size <N Only clusters with ≥ N members default 2 --errors-only Filter output to ERROR severity --show-info Include INFO in human report default hidden MODES: --calibrate Print histogram + suggestions + inferred directives --json Machine-readable output --repo-root <PATH Path-prefix for short paths in report default . OUTPUT: --no-cross-name Skip pass 2 --no-type3 Skip pass 3 Two crates, each useful on its own: | crate | role | |---|---| find-dup-defs | py-canon Python frontend — Ruff parse → ast.dump -shape canonical + def scanThe name is language-agnostic on purpose : Python is the current frontend; other languages can be added later as further frontend crates feeding the same CLI. The similarity engine is the exact Ratcliff–Obershelp port difflib-fast https://github.com/prostomarkeloff/difflib-fast . Python only today frontend crate per language; PRs welcome Type 4 semantic equivalence, different syntax → same logic — not done; neural-network research territory Token-level fine-grained duplication a 30-token sub-expression copy-pasted around — out of scope; use jscpd / PMD CPD alongside if you need that Calibration is heuristic — formula constants loc=20, args=5, weight 0.7/0.1/0.2 were tuned on the 28-repo benchmark; your codebase may want different Copy-paste has nowhere to hide. Made with ⚡ by @prostomarkeloff https://github.com/prostomarkeloff