{"slug": "help-with-a-local-document-rag-system-storage-ingestion-query-highlighting", "title": "Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)", "summary": "A developer is seeking advice on building a local, offline document retrieval and LLM pipeline for RAG systems, focusing on storage, ingestion, querying, and highlighting. The system aims to support PDF, DOCX, XLSX, CSV, and image formats with local LLM answer generation and citation tracking. Key questions involve vector DB vs pgvector, offline GraphRAG feasibility, and implementing document highlighting with citation preview.", "body_md": "Hey folks,\n\nI’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:\n\nSTORAGE\n\n- Upload PDF, DOCX, XLSX, CSV, tables\n- All data stored locally (no cloud)\n\nDOCUMENT INGESTION\n\n- Watch folder (e.g., Watchdog) → auto-ingest on file add/modify/delete\n- Nested folder structure → auto-tagging\n- Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG\n- Version control on re-upload\n\nQUERY & RETRIEVAL\n\n- Restrict queries to a single client’s documents (no cross-client leakage)\n- Structured queries (e.g., “Show invoices > ₹1 lakh”)\n- Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)\n- Keyword fallback\n\nHIGHLIGHTING & RENDERING\n\n- Annotated PDF served to frontend\n- XLSX → colored cell export\n- Jump directly to highlighted page\n- Multi-document highlights in one response\n\nANSWER GENERATION\n\n- Local LLM only\n- Every claim cited with doc + page reference\n\nMY QUESTIONS\n\n-\nParsing: I’m considering LlamaIndex LiteParse.\n\n→ Should I store document IDs + chunk IDs for PDFs to enable highlighting?\n\n-\nVector DB:\n\n- Do I need one (e.g., Qdrant)?\n- If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?\n- Would pgvector in Postgres be sufficient?\n\n-\nGraphRAGs:\n\n- How effective are systems like Neo4j or Microsoft GraphRAG?\n- Can they run locally/offline, or are they too computationally heavy?\n- Is this GraphRAG pipeline from LlamaIndex a good starting point?\n\n-\nHighlighting UX:\n\n- I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.\n- Any open-source projects that already do this?\n- I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.\n\nTL;DR\n\nTrying to build a local RAG system with:\n\n- Storage + ingestion + tagging\n- Query + retrieval + highlighting\n- Local LLM answer generation with citations\n\nLooking for advice on:\n\n- Vector DB vs pgvector\n- GraphRAG feasibility offline\n- Best way to implement document highlighting + citation preview\n\nWould love to hear from anyone who’s built something similar or explored these tools.", "url": "https://wpnews.pro/news/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting", "canonical_source": "https://discuss.huggingface.co/t/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting/176993#post_1", "published_at": "2026-06-20 08:44:01+00:00", "updated_at": "2026-06-20 09:12:43.937982+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "ai-agents", "developer-tools"], "entities": ["LlamaIndex", "Qdrant", "pgvector", "Postgres", "Neo4j", "Microsoft GraphRAG", "Kotaemon", "AnythingLLM"], "alternates": {"html": "https://wpnews.pro/news/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting", "markdown": "https://wpnews.pro/news/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting.md", "text": "https://wpnews.pro/news/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting.txt", "jsonld": "https://wpnews.pro/news/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting.jsonld"}}