{"slug": "beyond-embeddings-automated-document-validation-and-version-control-for-rag", "title": "Beyond Embeddings: Automated Document Validation and Version Control for RAG Knowledge Bases", "summary": "A developer has created a multi-stage document validation pipeline for RAG knowledge bases that uses deterministic UUIDv5 hashing for exact deduplication and HyperMinHash for similarity detection, addressing the unreliability of PDF metadata and the brittleness of regex or LLM-based extraction methods. The system combines MongoDB as a document lifecycle source of truth with Qdrant for vector search, enabling automated version control and data integrity in compliance-grade environments.", "body_md": "“I have been working with vector databases and RAGs, and I have realized that identifying same documents with different versions and maintain their data integrity is trickier than I expected. The reason for writing this article is to share how I approach the problem and what I have learnt.”\n\nVector databases store embeddings that capture semantic meaning and are widely used in Retrieval-Augmented Generation (RAG) systems, where PDFs are often the primary knowledge source.\n\nIn practice, traditional integrity checks such as file names or file sizes are insufficient. The same document may appear under different names, such as:\n\nThis leads to several issues:\n\nIn theory, PDF metadata could solve this problem by storing titles, versions, and publication dates.\n\nIn practice, it is unreliable.\n\nMetadata is often:\n\nThe following example demonstrates the failure of native PDF metadata extraction on the [ Work Health and Safety (Mines and Petroleum Sites) Act 2013 No. 54](https://legislation.nsw.gov.au/view/html/inforce/current/act-2013-054) document using Python:\n\n``` python\nimport fitz  # PyMuPDFfrom pypdf import PdfReaderdoc = fitz.open(\"mining_act_2022.pdf\")metadata = doc.metadatafor key, value in metadata.items():    print(f\"{key}: {value} -> {type(value)}\")\n```\n\n**Output:**\n\n``` php\nformat: PDF 1.5 -> <class 'str'>title: View - NSW legislation -> <class 'str'>author: -> <class 'str'>subject: -> <class 'str'>keywords: PCO, Parliamentary Counsel's Office, ...creator: -> <class 'str'>producer: Prince 15.1 (www.princexml.com) -> <class 'str'>creationDate: -> <class 'str'>modDate: -> <class 'str'>trapped: -> <class 'str'>encryption: None -> <class 'NoneType'>\n```\n\nThis makes metadata-based validation insufficient for compliance-grade systems where traceability and correctness are critical, and demonstrate the need to validate documents based on their **intrinsic content** rather than external file attributes.\n\nA natural first step in extracting key document metadata (e.g. title, version, publication date) is to combine:\n\nThe extracted metadata is then used to assess document identity and version. However, both approaches are insufficient as primary validation mechanisms.\n\nRegex is fast and deterministic but brittle — minor formatting changes (extra spaces, line breaks, or date format shifts) can break extraction rules and require constant maintenance.\n\nLLMs, on the other hand, are flexible but probabilistic. They may hallucinate or infer incorrect metadata such as version numbers or publication dates, which introduces silent data corruption risks in compliance systems.\n\n**Conclusion:** Neither Regex nor LLMs are reliable enough to determine document identity or version equivalence. They are useful only as supporting signals, not decision-makers.\n\nTo address these limitations, a multi-stage validation pipeline was designed with three components:\n\nThe pipeline is backed by a knowledge base consisting of two complementary storage systems:\n\nThis hybrid architecture separates document management from vector storage. Since Qdrant stores embeddings for individual document chunks rather than complete documents, document-level operations such as duplicate detection, version replacement, and deletion are difficult to manage directly. MongoDB acts as the single source of truth by maintaining the document lifecycle and storage context, while Qdrant is dedicated to efficient semantic vector search.\n\nThe following metadata will firstly be extracted from a processed document.\n\nDocument ID\n\n**Document ID** is a unique identifier generated by deterministic UUIDv5 hashing of the full document content. It is used for exact deduplication.\n\n```\n# Example of document id generationraw_bytes = Path(pdf_path).read_bytes()content_hex = raw_bytes.hex()doc_std_id = str(uuid.uuid5(DOCUMENT_ID_NAMESPACE, content_hex))\n```\n\nDocument Similarity ID\n\nA **Document Similarity ID** is generated using HyperMinHash, an advanced extension of MinHash. MinHash compresses large sets into compact signatures that preserve similarity relationships, enabling efficient comparison of documents without requiring full content processing.\n\nThese signatures are used to estimate Jaccard similarity. Jaccard similarity is measured by diving the intersection of two sets of signatures against their union. HyperMinHash extends this approach by incorporating principles from HyperLogLog, significantly reducing memory usage while maintaining accurate similarity estimation at scale.\n\nThis makes the approach well-suited for large-scale document indexing, enabling efficient detection of near-duplicate or highly similar documents for deduplication and retrieval optimization.\n\n```\n# Code Example of generating similarity id with minhash encodingdef _compute_doc_similarity_id(self, full_text: str) -> str:    sketch = pyhyperminhash.Sketch()    entry = pyhyperminhash.Entry()    entry.add_bytes(full_text.encode(\"utf-8\"))    sketch.add_entry(entry)    return base64.b64encode(sketch.save()).decode()\n```\n\nDocument Fingerprint\n\n```\n# Code Example of DocumentFingerprint Pydantic model classclass DocumentFingerprint(BaseModel):    primary_entity: str    entity_type: str    key_words: list[str]\n```\n\nA **Document Fingerprint** is a custom structured data object that stores semantic metadata extracted from a document using a Large Language Model (LLM). This typically includes themes, topics, keywords, and other high-level descriptors that characterize the document’s content.\n\nWhile HyperMinHash is effective at identifying documents with similar token distributions or structural patterns, it may produce false positives when documents share formatting characteristics but differ in subject matter.\n\nThe Document Fingerprint enables the system to distinguish between structurally similar but semantically unrelated documents. This improves the accuracy of deduplication, clustering, and similarity-based retrieval.\n\nThe following example illustrates the structure of a DocumentFingerprint object for [ Work Health and Safety (Mines and Petroleum Sites) Act 2013 No 54](https://legislation.nsw.gov.au/view/html/inforce/current/act-2013-054):\n\nPrimary Entity: Work Health and Safety (Mines and Petroleum Sites) Act 2013\n\nEntity Type: Regulation\n\nKeywords: PCO, Parliamentary Counsel’s Office, QLD PCO, QLD , Legislation, Bills of Parliament, Act, amendment, law, legal advice, legislation, Parliament\n\nDescription\n\nLLM extraction description for the document.\n\nDocument Title / Version / Date\n\nLLM extraction title, version and published date for the document.\n\nContent validation determines whether an incoming document is a duplicate, a newer version, or entirely new before it is indexed. Rather than relying on filenames or metadata, the pipeline validates the document’s intrinsic content, preventing duplicate embeddings, avoiding version conflicts, and ensuring that only high-quality, consistent information is stored in the vector database.\n\nExact Match Search\n\nThe first validation step checks whether the document has already been ingested.\n\nMinhash Similarity Check\n\n```\n# Coding Example for Jaccard similarity computationimport pyhyperminhashnew_sketch_bytes = base64.b64decode(new_doc_similarity_id)new_sketch = pyhyperminhash.Sketch.load(new_sketch_bytes)cand_sketch_bytes = base64.b64decode(candidate_similarity_id)cand_sketch = pyhyperminhash.Sketch.load(cand_sketch_bytes)jaccard_sim = new_sketch.similarity(cand_sketch)jaccard_threshold = SIMILARITY_CONFIG[\"jaccard_threshold\"]if jaccard_sim < jaccard_threshold:   print(\"New document detected\")   \"\"\"    Upload as a new document to vector daatabase   \"\"\"else:   print(\"Similar document detected\")   \"\"\"    Subsequent identity score calculation...   \"\"\"\n```\n\nIdentity Score Calculation\n\nIf the MinHash similarity exceeds the configured threshold, a final semantic validation step determines whether the document is a duplicate.\n\nThe process is as follows:\n\nThe weighted identity score is computed as:\n\nwhere w1, w2, and w3 are configurable weights that determine the contribution of each similarity measure.\n\nThe score is then compared against a threshold to determine document identity.\n\nEnsuring data integrity in a RAG knowledge base requires more than embedding generation and indexing. Without a robust validation strategy, duplicate documents and conflicting versions can accumulate, leading to redundant retrieval, inconsistent context, and degraded LLM responses.\n\nThis work presents a multi-stage validation pipeline that combines deterministic document IDs, HyperMinHash-based similarity detection, semantic identity scoring, and LLM-assisted metadata comparison. While version and publication extraction remains imperfect, this layered approach provides a scalable framework that integrates deterministic and semantic signals to improve document validation.\n\nAs RAG systems move into production use, document validation and version control become increasingly important. Ultimately, the effectiveness of a RAG system depends not only on the embedding model or vector database, but also on the cleanliness and consistency of its underlying knowledge base.\n\nThis is my first time publishing on Medium! If you find it helpful, please feel free to leave your experiences, thoughts, or comments below. If you enjoy this article, don’t forget to leave somea few 👏. Thanks for reading!\n\n[Beyond Embeddings: Automated Document Validation and Version Control for RAG Knowledge Bases](https://pub.towardsai.net/beyond-embeddings-automated-document-validation-and-version-control-for-rag-knowledge-bases-4cd49d3b9b36) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/beyond-embeddings-automated-document-validation-and-version-control-for-rag", "canonical_source": "https://pub.towardsai.net/beyond-embeddings-automated-document-validation-and-version-control-for-rag-knowledge-bases-4cd49d3b9b36?source=rss----98111c9905da---4", "published_at": "2026-07-04 12:31:01+00:00", "updated_at": "2026-07-04 12:54:36.576114+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-tools", "ai-infrastructure", "developer-tools"], "entities": ["MongoDB", "Qdrant", "PyMuPDF", "HyperMinHash", "UUIDv5", "NSW legislation", "Work Health and Safety (Mines and Petroleum Sites) Act 2013"], "alternates": {"html": "https://wpnews.pro/news/beyond-embeddings-automated-document-validation-and-version-control-for-rag", "markdown": "https://wpnews.pro/news/beyond-embeddings-automated-document-validation-and-version-control-for-rag.md", "text": "https://wpnews.pro/news/beyond-embeddings-automated-document-validation-and-version-control-for-rag.txt", "jsonld": "https://wpnews.pro/news/beyond-embeddings-automated-document-validation-and-version-control-for-rag.jsonld"}}