Deterministic Guardrails Against AI Code Duplication

wpnews.pro

cd /news/ai-tools/deterministic-guardrails-against-ai-… · home › topics › ai-tools › article

[ARTICLE · art-46998] src=technology.org ↗ pub=2026-07-03T20:31Z topic=ai-tools verified=true sentiment=· neutral

Deterministic Guardrails Against AI Code Duplication

GitClear's analysis of 211 million changed lines found cloned code blocks have roughly quadrupled since 2022 due to AI coding agents duplicating code. A deterministic guardrail tool called dupehound, built in Rust, uses structural fingerprinting to detect duplicates without AI, running locally at 1.5 million lines per second. It can be integrated into CI or as an MCP server to prevent AI code duplication.

read4 min views1 publishedJul 3, 2026

Deterministic Guardrails Against AI Code Duplication — Image: source

AI coding agents produce code faster than it can be reviewed. A common result is AI slop and most of AI slop is not bad or broken code, but duplicated code. GitClear’s analysis of 211 million changed lines found cloned code blocks have roughly quadrupled since 2022.

Duplicated code happens because an AI coding agents often cannot hold the whole repository in its context window, especially in large codebases. So when it needs a function that already exists, it usually does not find it and writes another copy. Asking an AI coding agent to spot the duplicated code does not work, mostly for the same reason the copies get created: it cannot properly see the parts of the repository that are not in its context. The larger the codebase, the bigger the problem.

A deterministic guardrail check in the agent loop works better. It does not use a model, runs locally, and is fast enough to run on every change (scans roughly 1.5 million lines per second).

Building deterministic guardrails against AI code duplication

I packaged this as dupehound, a single-binary CLI in Rust.

Dupehound is an index. But a plain text index doesn’t work here, because the copies are not textual: renamed functions share almost no tokens and almost all structure. Dupehound fingerprints the structure instead.

To fingerprint structure, we used a technique called winnowing, which was worked out in 2003 for a different scenario: students who rename variables before submitting copied homework. Stanford’s MOSS plagiarism detector is built on it (Schleimer, Wilkerson & Aiken, 2003), and it transfers to AI-renamed code almost unchanged.

The pipeline has four stages:

Parse. tree-sitter splits each file into functions; the function body is the unit, so imports and signatures never cause a match.Normalize. Identifiers, strings, and numbers become sentinels, comments are dropped, and keywords and control flow stay.** Fingerprint.10-token windows are hashed and winnowed. The test suite checks it as a property. Match**. Shared fingerprints produce candidate pairs, so there is no all-pairs pass. Boilerplate fingerprints are dropped, similarity is exact Jaccard, and union-find groups the clusters.

Using dupehound to avoid duplicated code

dupehound has 2 main commands.

Scan: reports every duplicate cluster and a repo-level slop score
Check: fails CI when a change duplicates existing code, naming the original to reuse

Scan* reports the clusters of duplicated code and a slop score.The slop score is the percentage of code you could delete if every cluster kept a single copy. The largest copy is exempt, and test files are excluded by default, since table-driven tests are repetitive by design.

** Check **is the part that runs in the loop. It indexes the codebase at the base revision, looks only at the functions a change touches, and exits non-zero with the location of the original:

$ dupehound check --diff main .
src/api/orders.ts:1 calculateOrderAmount() is a 100% duplicate of
src/billing/invoice.ts:1 computeInvoiceTotal() — reuse it

Moved functions and in-place edits do not fire. The one-line output exists so it can go back to the agent that wrote the duplicate. The lighter way to wire that is an instruction in CLAUDE.md or AGENTS.md:

Before committing, run `dupehound check .`. If it reports that a function
you wrote duplicates existing code, delete your version and reuse the
original at the reported location.

The tighter way is the MCP server. dupehound mcp. The MCP exposes check_duplication and scan_duplication , so the agent can call them while it edits:

claude mcp add dupehound -- dupehound mcp

It is a local pipe with no AI in it. A model in the loop does not work. A deterministic index in the loop does, and the agent is the one calling it: it writes a function, asks whether that function already exists, and reuses the original when it does

**Evaluation: the hide-and-seek benchmark **

To test dupehound against a model, I planted 39 known duplicate function pairs into real code from microsoft/vscode, a 3.3-million-line TypeScript codebase, and grew the host from 10,000 to 1,000,017 lines.

I gave each agent run a fixed budget of 150 turns (one turn is a single read or search) and 15 minutes. At this scale the budget is the limiting factor. The recall numbers below are what an agent finds under this budget.

dupehound recovers 36 of 39 at every size. The agents recover about half at 10,000 lines and fewer as the tree grows. Opus recovers none at a million lines, and both Sonnet runs hit the cap before returning a result, which is what “did not finish” means in the table. Full method, per-type and per-run tables are in the repo.

dupehound is open source under MIT, a single binary, and runs offline with no API keys and no model in the pipeline.

Author Bio: Rafael Pinheiro Costa

Rafael is a founder with 10+ years of experience building AI products. He currently leads Ottic, a digital worker that builds and runs marketing and operations workflows. Previously, he founded Clipping, an AI content-curation platform for diplomats, and has worked across Europe, LATAM, and the US building automating content systems.

source & further reading

technology.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/deterministic-guardrails…

Read original on technology.org → www.technology.org/2026/07/03/dupehound-determin…

mentioned entities

GitClear

dupehound

Rust

tree-sitter

MOSS

Stanford

metadata

slugdeterministic-guardrails-against-ai-code-duplication

topic#ai-tools

secondary2 topics

sentimentneutral

canonicaltechnology.org

navigation

← prevShow HN: Sharing LLM Artifacts w…

next →AI Is Boring

── more in #ai-tools 4 stories · sorted by recency

blog.herlein.com · 3 Jul · #ai-tools

A Chainsaw at an Axe-Throwing Contest: My Current Agentic Loop

developers.googleblog.com · 1 Jul · #ai-tools

Build agentic full-stack apps with Genkit

margaine.com · 1 Jul · #ai-tools

Show HN: Hunch, search your email by a hunch; with a conversation

gist.github.com · 3 Jul · #ai-tools

OpenCode, Pi, and Goose: Three Layers of the AI Agent Stack

── more on @gitclear 3 stories trending now

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 1 Jul · #ai-infrastructure

My Notes After Databricks Data and AI Summit 2026

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required