cd /news/ai-tools/finding-code-duplicated-by-ai-withouโ€ฆ ยท home โ€บ topics โ€บ ai-tools โ€บ article
[ARTICLE ยท art-25129] src=github.com pub= topic=ai-tools verified=true sentiment=ยท neutral

Finding code duplicated by AI without AI

Dupehound, a new open-source tool for detecting duplicated code in AI-generated codebases, has been released. The tool uses structural fingerprinting rather than text matching to identify duplicate functions, even after identifiers and literals have been renamed, and provides a "slop score" measuring the percentage of deletable duplicate code. dupehound runs locally without network access or machine learning, offering three commands โ€” scan, history, and check โ€” to detect, track, and prevent code duplication in CI pipelines.

read4 min publishedJun 12, 2026

dupehound is a duplicate-code detector built for codebases where agents write most of the code. It finds functions that exist more than once, even after every identifier and literal has been renamed, because it fingerprints the structure of the code instead of its text.

dupehound runs three commands: scan

reports every duplicate cluster and a repo-level slop score, history

charts duplication across the git log and pinpoints when it took off, and check

fails CI when a change duplicates code that already exists, naming the original to reuse. Everything runs locally and deterministically: no network, no API keys, no machine learning.

Prebuilt binaries for macOS, Linux and Windows are on the releases page, or:

cargo install dupehound

history

and check

require git

on PATH. scan

works on any directory.

dupehound scan [path]

ranks duplicate clusters by deletable lines:

$ dupehound scan .

  dupehound v0.1.0 โ€” scanned 19 files ยท 370 lines ยท 27 functions in 21ms

  โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
  โ”‚  SLOP SCORE   36.1%   grade F                           โ”‚
  โ”‚  127 of 352 significant lines are deletable duplicates  โ”‚
  โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

  โ— Cluster 1 โ”€ 4 copies ยท 100% similar ยท 42 deletable lines โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    โ˜… src/utils/date.ts:1        formatDate        14 lines
      src/api/timestamps.ts:1    renderTimestamp   14 lines   100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
      src/jobs/report_dates.ts:1 stringifyDate     14 lines   100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
      src/billing/dates.ts:1     humanizeDate      14 lines   100% โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ

  โ˜… = representative (kept) ยท dupehound scan --explain 1 shows the code

The slop score is the percentage of code you could delete if every cluster kept only one copy; the largest copy is exempt and test files are excluded by default, since table-driven tests are repetitive by design. --explain N

prints a cluster's code as proof, --json

emits a versioned schema, --card

writes a score card as SVG and PNG. Languages: TypeScript, TSX, JavaScript, Python, Rust, Go, Java.

dupehound history

measures the slop score at monthly snapshots, reading blobs straight from the object database (no checkouts), and reports when duplication took off:

   36.1% โ”ค                      โ–ˆโ–ˆ
         โ”ค                  โ–‚โ–‚โ–†โ–†โ–ˆโ–ˆ
         โ”ค              โ–‚โ–‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
         โ”ค          โ–โ–โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
    0.0% โ”ค          โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
          2025-01          2025-12

  current slop score: 36.1% (grade F)
  duplication went from ~0 to 36.1% since 2025-05

dupehound check

gates CI and pre-commit. It indexes the codebase at the base revision and probes only the functions a change adds or touches. Moved functions and in-place edits don't fire. Exit codes: 0 clean, 1 findings, 2 error.

$ dupehound check --diff main .
src/api/orders.ts:1 calculateOrderAmount() is a 100% duplicate of src/billing/invoice.ts:1 computeInvoiceTotal() โ€” reuse it

A GitHub Actions recipe and a pre-commit setup are in docs/ci.md. To make a coding agent reuse code instead of rewriting it, feed check

back to it from CLAUDE.md

or AGENTS.md

; the snippet is there too.

Function bodies are parsed with tree-sitter and normalized: identifiers, strings and numbers become sentinels, comments are dropped, structure stays. k-grams of 10 tokens are rolling-hashed and selected by robust winnowing (Schleimer, Wilkerson & Aiken, SIGMOD 2003), which guarantees any shared run of 17 normalized tokens is caught. An inverted fingerprint index generates candidate pairs, boilerplate fingerprints are culled, similarity is exact Jaccard, union-find builds the clusters.

The defaults are conservative about false positives: generated, minified and vendored files are skipped, functions under 40 normalized tokens are ignored, and every match is verifiable with --explain

. Grade buckets were calibrated against express (0.0%), gin (0.2%), tokio (1.1%), fastapi (1.7%) and vscode (2.8%), all grade A. vscode, at 2.97M lines and 53k functions, scans in 3.6s on a laptop. Full design notes in docs/design.md.

Coding agents don't know what a codebase already contains, so they re-implement it. formatDate

becomes renderTimestamp

, then stringifyDate

: the same logic under several names, each copy aging independently. Analyses of millions of commits report duplication roughly doubled since AI assistants went mainstream.

An LLM can't do this job. Duplicate detection compares every function against every other; a model samples what fits in context, an index checks everything. A merge gate must be reproducible: same input, same verdict, an algorithm you can read. dupehound is the deterministic side of the loop: the agent writes, the index remembers.

Please file issues on the issue tracker. The most useful false-positive report is a small code pair that matches but shouldn't, plus the --explain

output; these become regression fixtures directly.

PRs welcome. Adding a language is the most wanted contribution and is roughly one tree-sitter query file; see CONTRIBUTING.md.

MIT. Bundled JetBrains Mono subsets are under the SIL OFL 1.1. The diagram uses Excalidraw's Virgil font (OFL).

โ”€โ”€ more in #ai-tools 4 stories ยท sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain โ€” perfect for shipping the agent you just read about.

$git push zahid main
โ†’ Live at https://your-agent.zahid.host โœ“
Get free account โ†’ Pricing
from โ‚ฌ0/mo ยท no card required
LIVE [news/finding-code-duplicaโ€ฆ] indexed:0 read:4min 2026-06-12 ยท โ€”