# Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)

> Source: <https://discuss.huggingface.co/t/help-with-a-local-document-rag-system-storage-ingestion-query-highlighting/176993#post_1>
> Published: 2026-06-20 08:44:01+00:00

Hey folks,

I’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:

STORAGE

- Upload PDF, DOCX, XLSX, CSV, tables
- All data stored locally (no cloud)

DOCUMENT INGESTION

- Watch folder (e.g., Watchdog) → auto-ingest on file add/modify/delete
- Nested folder structure → auto-tagging
- Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG
- Version control on re-upload

QUERY & RETRIEVAL

- Restrict queries to a single client’s documents (no cross-client leakage)
- Structured queries (e.g., “Show invoices > ₹1 lakh”)
- Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)
- Keyword fallback

HIGHLIGHTING & RENDERING

- Annotated PDF served to frontend
- XLSX → colored cell export
- Jump directly to highlighted page
- Multi-document highlights in one response

ANSWER GENERATION

- Local LLM only
- Every claim cited with doc + page reference

MY QUESTIONS

-
Parsing: I’m considering LlamaIndex LiteParse.

→ Should I store document IDs + chunk IDs for PDFs to enable highlighting?

-
Vector DB:

- Do I need one (e.g., Qdrant)?
- If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?
- Would pgvector in Postgres be sufficient?

-
GraphRAGs:

- How effective are systems like Neo4j or Microsoft GraphRAG?
- Can they run locally/offline, or are they too computationally heavy?
- Is this GraphRAG pipeline from LlamaIndex a good starting point?

-
Highlighting UX:

- I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.
- Any open-source projects that already do this?
- I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.

TL;DR

Trying to build a local RAG system with:

- Storage + ingestion + tagging
- Query + retrieval + highlighting
- Local LLM answer generation with citations

Looking for advice on:

- Vector DB vs pgvector
- GraphRAG feasibility offline
- Best way to implement document highlighting + citation preview

Would love to hear from anyone who’s built something similar or explored these tools.