# How I Found Out 52% of My Knowledge Graph Was Duplicates (and What I Did About It)

> Source: <https://dev.to/ernesto_arias_148b35bc25d/-how-i-found-out-52-of-my-knowledge-graph-was-duplicates-and-what-i-did-about-it-3coh>
> Published: 2026-06-25 00:50:17+00:00

I've spent the last several months building [ANIMUS](https://github.com/ernestoariasdiaz/animus-ai), an autonomous system in Rust that gives a local LLM persistent memory. The idea is simple: a knowledge graph that grows on its own, cycle after cycle, as the system reads documents, detects gaps in its knowledge, and fills them in.

For months, the metric I watched most closely was the node count of the graph. It kept climbing. I felt good about that.

Until I ran a full audit and found out that **52% of those nodes were undetected duplicates**. Of 1,892 reported nodes, only 911 were actually unique.

ANIMUS's autonomous loop actively looks for "gaps" — holes in its knowledge that the system decides to fill on its own. The problem: an overly aggressive filter was excluding certain categories from the gap pool, which trapped the system in a loop of re-exploring the same ~40 topics for thousands of cycles. Each pass generated content that was *similar* but not identical to the last — different enough to avoid triggering any exact-duplicate check, but substantially the same information rephrased.

The node count kept climbing. Actual knowledge, not so much.

The fix wasn't magic, it was audit work:

`Brain::search`

): it walked the graph from node 0 with `.take(2)`

, which meant it almost always returned stale content from earlier versions of the system. A simple `.rev()`

fixed it.Along the way, I also migrated the inference engine: from a Python wrapper to a `llama-server.exe`

launched directly from Rust, and from the original model to a quantized Gemma 4 E2B, running at ~77 tokens/second on a consumer GPU (RTX 3050, 4GB). None of this required the cloud or paid APIs — everything runs locally.

The most valuable part of this whole episode wasn't fixing the bug. It was realizing that **a metric that only goes up never warns you that something is wrong**. Node count was a proxy for "the system is learning," but optimizing that one proxy, with nothing to balance it, ended up producing the opposite: inflated content, not new knowledge.

ANIMUS now runs on several cross-checked signals (verified uniqueness, recency-weighted relevance, source validation) instead of one vanity metric. If two signals start to diverge, the system stops and re-audits instead of continuing to generate.

If you're curious about the full picture (architecture, benchmarks, comparison against a simple vector RAG baseline), the technical paper is open access with a DOI: [10.5281/zenodo.20674981](https://doi.org/10.5281/zenodo.20674981). Code is on [GitHub](https://github.com/ernestoariasdiaz/animus-ai).

*ANIMUS is an independent project, developed in Santo Domingo, Dominican Republic.*
