Beyond Embeddings: Automated Document Validation and Version Control for RAG Knowledge Bases A developer has created a multi-stage document validation pipeline for RAG knowledge bases that uses deterministic UUIDv5 hashing for exact deduplication and HyperMinHash for similarity detection, addressing the unreliability of PDF metadata and the brittleness of regex or LLM-based extraction methods. The system combines MongoDB as a document lifecycle source of truth with Qdrant for vector search, enabling automated version control and data integrity in compliance-grade environments. “I have been working with vector databases and RAGs, and I have realized that identifying same documents with different versions and maintain their data integrity is trickier than I expected. The reason for writing this article is to share how I approach the problem and what I have learnt.” Vector databases store embeddings that capture semantic meaning and are widely used in Retrieval-Augmented Generation RAG systems, where PDFs are often the primary knowledge source. In practice, traditional integrity checks such as file names or file sizes are insufficient. The same document may appear under different names, such as: This leads to several issues: In theory, PDF metadata could solve this problem by storing titles, versions, and publication dates. In practice, it is unreliable. Metadata is often: The following example demonstrates the failure of native PDF metadata extraction on the Work Health and Safety Mines and Petroleum Sites Act 2013 No. 54 https://legislation.nsw.gov.au/view/html/inforce/current/act-2013-054 document using Python: python import fitz PyMuPDFfrom pypdf import PdfReaderdoc = fitz.open "mining act 2022.pdf" metadata = doc.metadatafor key, value in metadata.items : print f"{key}: {value} - {type value }" Output: php format: PDF 1.5 -