Hey folks,
I’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:
STORAGE
-
Upload PDF, DOCX, XLSX, CSV, tables
-
All data stored locally (no cloud) DOCUMENT INGESTION
-
Watch folder (e.g., Watchdog) → auto-ingest on file add/modify/delete
-
Nested folder structure → auto-tagging
-
Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG
-
Version control on re-upload QUERY & RETRIEVAL
- Restrict queries to a single client’s documents (no cross-client leakage)
- Structured queries (e.g., “Show invoices > ₹1 lakh”)
- Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)
- Keyword fallback
HIGHLIGHTING & RENDERING
- Annotated PDF served to frontend
- XLSX → colored cell export
- Jump directly to highlighted page
- Multi-document highlights in one response
ANSWER GENERATION
- Local LLM only
- Every claim cited with doc + page reference
MY QUESTIONS
Parsing: I’m considering LlamaIndex LiteParse.
→ Should I store document IDs + chunk IDs for PDFs to enable highlighting?
Vector DB:
-
Do I need one (e.g., Qdrant)?
-
If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?
-
Would pgvector in Postgres be sufficient?
GraphRAGs:
-
How effective are systems like Neo4j or Microsoft GraphRAG?
-
Can they run locally/offline, or are they too computationally heavy?
-
Is this GraphRAG pipeline from LlamaIndex a good starting point?
Highlighting UX:
- I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.
- Any open-source projects that already do this?
- I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.
TL;DR
Trying to build a local RAG system with:
- Storage + ingestion + tagging
- Query + retrieval + highlighting
- Local LLM answer generation with citations
Looking for advice on:
- Vector DB vs pgvector
- GraphRAG feasibility offline
- Best way to implement document highlighting + citation preview
Would love to hear from anyone who’s built something similar or explored these tools.