Find-dup-defs – find duplicated Python code at the speed of light

wpnews.pro

Find the copy-pasted code your linter can't — and tell you which copies to refactor first.

Duplicate & near-duplicate definitions — functions, methods, classes, constants, type

aliases — clustered by structural AST canonicalization, ranked by a normalized Thickness score, graded ERROR / WARNING / INFO, with auto-suggested project-specific noise filters out of the box.

2-12× faster than PMD CPD / jscpd. Calibrates itself on first run.

GitClear's 2025 report (211M lines of code analyzed): copy-pasted lines grew from 8.3% to 12.3% of all changes 2021→2024, while refactored lines dropped from 25% to under 10%. For the first time in measurable history, copy/paste exceeded code reuse. AI assistants don't know your project's _helper.py

— they emit the copy.

find-dup-defs

is the gate.

$ find-dup-defs ./src --calibrate
=== thickness calibration (ERROR): 76 clusters analyzed ===

distribution (each ▇ ≈ one ERROR cluster, scaled to fit):
  T [0.0, 0.1)
  T [0.1, 0.2)  ▇▇▇▇ 2
  T [0.2, 0.3)  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 25
  T [0.3, 0.4)  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 27
  T [0.4, 0.5)  ▇▇▇▇▇▇▇▇▇ 8
  T [0.5, 0.6)  ▇▇▇▇▇▇▇▇▇ 8
  T [0.6, 0.7)  ▇▇▇▇▇ 4
  T [0.7, 0.8)  ▇▇▇ 2

suggested thresholds (p50/p75/p90):

  permissive   --error-thickness 0.27  →  38 ERROR remain  (median dup: 8 loc, 2 args)
    e.g. ASTConverter.visit_Try/ASTConverter.visit_TryStar  [T=0.27, loc=16, args=2]
         mypy/fastparse.py:1367, :1384
    ┌──
    │ def visit_Try(self, n: ast3.Try) -> TryStmt:
    │     vs = [self.set_line(NameExpr(h.name), h) if h.name is not None else None
    │           for h in n.handlers]
    │     types = [self.visit(h.type) for h in n.handlers]
    │     handlers = [self.as_required_block(h.body) for h in n.handlers]
    │     node = TryStmt(...)
    │     return self.set_line(node, n)
    └──

=== inferred directives (auto-detected noise patterns) ===

  → -D 'de-escalate:*:*@*tests/*=test parametrize/fixture candidates'
    rationale: 21 clusters live entirely in test paths
    affects: 21 total (10 ERROR, 11 WARNING, 0 INFO)

One calibrate call gives you:

Histogram of finding "thickness" (refactor urgency)
Three suggested thresholds ( permissive

/balanced

/strict

) witha code sample at each level - Auto-detected noise patterns as ready-to-paste -D

directives

Then your CI:

find-dup-defs ./src \
  --error-thickness 0.5 --warning-thickness 0.4 --escalate-thickness 0.55 \
  -D 'de-escalate:*:*@*tests/*=test fixtures' \
  --errors-only --json

…exits non-zero only on actionable refactor candidates.

cargo install find-dup-defs

…or grab a prebuilt binary from the Releases page.

Every .py

file is parsed once (Ruff parser, PEP 695 / 701 ready), each callable yielded as top-level functions and class methods (Foo.bar

, Foo.Inner.baz

):

name-gated— same-(kind, name)

defs clustered by exact Ratcliff–Obershelp similarity on theast.dump

-shape canonical.cross-name— renamed copy-paste: alpha-renamed canonical bucketed, ≥2 distinct names across ≥2 sites.** Type-3**(ECScan) — IDF-weighted cosine over name-agnostic lines; catches edited renamed copies the exact pass misses.

@overload

/@abstractmethod

/ Protocol stubs — bodies of...

/pass

/ docstring filtered at extractionraise NotImplementedError

ABC declarationsreturn False / None / 0 / "x"

dispatch overrides (huge cross-name FP source — gone)@property

+@x.setter

/.deleter

— accessor role baked into the nameself

/cls

receivers — stripped so methods can match equivalent free functions

Three tiers, controllable via three thresholds:

ERROR  ←→  WARNING  ←→  INFO
  ↑           ↑          ↑
gate       review     hidden by default (JSON-only, --show-info to display)

--error-thickness X

— ERROR demotes to WARNING if T < X--warning-thickness X

— WARNING demotes to INFO if T < X--escalate-thickness X

— any cluster with T ≥ X is forced to ERROR (catches fat multi-copy patterns the name-gated heuristics demote by sim alone)

A normalized [0, 1] "GET ME REFACTORED" score combining:

Dedup volume=(n_members - 1) × loc

— how many lines you'd actually delete (dominant signal)Args— wide signatures register as architecturally chunkier** Similarity**— higher confidence dups score higher

T = 0.7 · sat(volume, 30) + 0.1 · sat(args, 5) + 0.2 · sim
sat(x, k) = 1 − exp(−x/k)

Sort findings by T → biggest refactor wins first.

User-authored overrides for repo-specific intentional duplication.

ACTION : [KIND:] NAME [@PATH] [=NOTE]

Action	Effect	Severity
`suppress`
Drop entirely	gone
`de-escalate`
One step down	ERROR→WARNING→INFO
`escalate`
One step up	INFO→WARNING→ERROR
`note`
Annotate only	unchanged

-D 'de-escalate:METHOD:Plugin.get_*_hook=intentional plugin no-op API'

-D 'suppress:FUNCTION:spawn@*mypyc/lib-rt/*=bootstrap copy: lib-rt cannot import from mypyc'

-D 'escalate:METHOD:Lock.*@*/storage/*=Lock/LockExtend must share impl before v1.0'

-D 'note:METHOD:For*.begin_body=v2 refactor target (see issue #42)'

Notes show up inline:

DUPLICATE METHOD [ERROR]: Lock.hold/LockExtend.hold  [normalized-exact, T=0.67, n=2, loc=28]

The --calibrate

step pattern-matches across findings and surfaces ready-to-paste directives for repeating noise patterns. No manual config; just paste suggestions you agree with into CI.

Pattern	Suggestion
≥5 CONSTANT clusters where ≥80% members are in `/locale`
`suppress:CONSTANT:@locale*`
≥10 clusters where all members live in test paths	`de-escalate::@tests/`
≥3 clusters touching `_pb2` / `_grpc` files
`suppress::@_pb2`
≥3 clusters all under `/migrations/` / `/alembic/versions/`
`suppress::@migrations/`
≥5 clusters all under `/docs_src/` / `/examples/` / `/tutorial/`
`de-escalate::@docs_src/`

Verified on real benchmark: 69% noise reduction across 14 repositories ≥150K Python SLOC each.

hyperfine --warmup 1 --runs 3

on macOS arm64 (M-series), against jscpd@4 and

PMD CPD 7.24— both Python-mode, same target tree:

Repo (Python files)	find-dup-defs	PMD CPD	jscpd
`pip` (633)
0.18 s
0.87 s (4.9×)	3.21 s (18.2×)
`mypy/mypy` (155)
0.18 s
0.81 s (4.6×)	1.47 s (8.4×)
`sympy` (1 589)
1.22 s
4.29 s (3.5×)	15.18 s (12.4×)
`django` (2 910)
1.01 s
2.08 s (2.1×)	9.67 s (9.6×)

PMD ran with --minimum-tokens=100

; jscpd with defaults (min-lines=5, min-tokens=50). find-dup-defs

does more semantic work (alpha-renamed canonical, IDF cosine, severity grading, calibration) and is still 3-12× faster end-to-end — Rust + rayon-parallel extraction, single-parse Ruff frontend, no JVM/Node startup tax.

On django

(426K SLOC, 2 910 files):

find-dup-defs

:~422K SLOC/sec- PMD CPD: ~205K SLOC/sec

jscpd: ~44K SLOC/sec

Across 14 repos with ≥150K SLOC + 14 with 50K-150K each (≈8M SLOC total), find-dup-defs --calibrate

auto-applied directives reduce raw ERROR count by 67% on average:

Repo	Raw ERROR	After CI flags + auto-inferred	%cut	Top remaining cluster
django/django	559	71	87%	`TupleGreaterThan.get_fallback_sql` (n=4 SQL ops)
wagtail/wagtail	496	65	86%	`set_privacy` (n=2)
apache/airflow	2203	337	84%	`CloudComposerGetEnvironmentOperator` (n=18)
home-assistant/core	4475	850	81%	`ConfigFlow.async_step_*` (n=178)
pandas-dev/pandas	406	78	80%	`read_csv/read_table` (n=2)
scipy/scipy	492	140	71%	`dct/dst/idct/idst` (n=4)
numpy/numpy	316	96	69%	`std/var` (n=2)

Top findings on this corpus are textbook PR candidates:

Versionpip

__lt__/__le__/__eq__/__ge__/__gt__

× 6 — minus 130 lines via one_compare

helperscipy

dct/dst/idct/idst

× 4 — minus ~330 lines via factory generatordjango

TupleGreaterThan/...

× 4 — minus ~75 lines via aTupleLookupMixin

methodscikit-learn

BaseSGDClassifier._fit / BaseSGDRegressor._fit

— classic dupe between sibling estimators

One-shot workflow for autonomous refactor agents:

find-dup-defs ./repo --calibrate --json > calib.json


find-dup-defs ./repo \
  --error-thickness <calib> --warning-thickness <calib> \
  $(jq -r '.inferred_directives[].directive | "-D \"" + . + "\""' calib.json) \
  --errors-only --json > findings.json

No file-system roundtrips needed — the snippet ships in JSON.

$ find-dup-defs ./mypy
--- duplicate functions (cross-file, AST sim warn=0.5 error=0.85) ---
DUPLICATE FUNCTION [ERROR]: generate_hash_wrapper/generate_len_wrapper
  [normalized-exact, T=0.63, n=2, loc=24, args=3]
  mypyc/codegen/emitwrapper.py:546, :573

--- duplicate methods (cross-file, ...) ---
DUPLICATE METHOD [ERROR]: CallableType.formal_arguments/Parameters.formal_arguments
  [normalized-exact, T=0.56, n=2, loc=19, args=2]
  mypy/types.py:1992, :2341

--- duplicate methods (cross-name, exact AST-normalized) ---
DUPLICATE METHOD [ERROR]: For{Async,,Native}Iterable.begin_body
  [normalized-exact, T=0.53, n=3, loc=9, args=1]
  mypyc/irbuild/for_helpers.py:716, :778, :848

USAGE:
  find-dup-defs [OPTIONS] <PATHS>...

THICKNESS LADDER:
  --error-thickness <F>      Demote ERROR → WARNING if T < this (default 0.0 = off)
  --warning-thickness <F>    Demote WARNING → INFO  if T < this (default 0.0 = off)
  --escalate-thickness <F>   Promote anything → ERROR if T ≥ this (default 0.0 = off)

SIMILARITY (name-gated):
  -t, --threshold <F>        Cluster floor (default 0.5)
  -e, --error-threshold <F>  ERROR floor   (default 0.85)
  --type3-theta <F>          Type-3 cosine threshold (default 0.7)

FILTERS:
  -D, --directive <S>        ACTION:[KIND:]NAME[@PATH][=NOTE], repeatable
  --kinds <K1,K2,...>        functions,methods,classes,constants,type-aliases
  --min-size <N>             Only clusters with ≥ N members (default 2)
  --errors-only              Filter output to ERROR severity
  --show-info                Include INFO in human report (default hidden)

MODES:
  --calibrate                Print histogram + suggestions + inferred directives
  --json                     Machine-readable output
  --repo-root <PATH>         Path-prefix for short paths in report (default .)

OUTPUT:
  --no-cross-name            Skip pass 2
  --no-type3                 Skip pass 3

Two crates, each useful on its own:

crate	role
`find-dup-defs`

py-canon

Python frontend— Ruff parse →ast.dump

-shape canonical + def scanThe name is language-agnostic on purpose: Python is the current frontend; other languages can be added later as further frontend crates feeding the same CLI. The similarity engine is the exact Ratcliff–Obershelp port difflib-fast.

Python only today (frontend crate per language; PRs welcome)Type 4 (semantic equivalence, different syntax → same logic)— not done; neural-network research territory** Token-level fine-grained duplication**(a 30-token sub-expression copy-pasted around) — out of scope; use jscpd / PMD CPD alongside if you need that** Calibration is heuristic**— formula constants (loc=20, args=5, weight 0.7/0.1/0.2) were tuned on the 28-repo benchmark; your codebase may want different

Copy-paste has nowhere to hide.

Made with ⚡ by @prostomarkeloff

source & further reading

github.com — original article

Pattern	Suggestion
≥5 CONSTANT clusters where ≥80% members are in `/locale`
`suppress:CONSTANT:@locale*`
≥10 clusters where all members live in test paths	`de-escalate::@tests/`
≥3 clusters touching `_pb2` / `_grpc` files
`suppress::@_pb2`
≥3 clusters all under `/migrations/` / `/alembic/versions/`
`suppress::@migrations/`
≥5 clusters all under `/docs_src/` / `/examples/` / `/tutorial/`
`de-escalate::@docs_src/`

Find-dup-defs – find duplicated Python code at the speed of light

Run your AI side-project on zahid.host