cd /news/large-language-models/help-with-a-local-document-rag-syste… · home topics large-language-models article
[ARTICLE · art-34722] src=discuss.huggingface.co ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Help with a Local Document RAG System (Storage + Ingestion + Query + Highlighting)

A developer is seeking advice on building a local, offline document retrieval and LLM pipeline for RAG systems, focusing on storage, ingestion, querying, and highlighting. The system aims to support PDF, DOCX, XLSX, CSV, and image formats with local LLM answer generation and citation tracking. Key questions involve vector DB vs pgvector, offline GraphRAG feasibility, and implementing document highlighting with citation preview.

read2 min views1 publishedJun 20, 2026

Hey folks,

I’m working on designing a local, offline document retrieval + LLM pipeline and would love your input on the architecture. Here’s what I’m aiming for:

STORAGE

  • Upload PDF, DOCX, XLSX, CSV, tables

  • All data stored locally (no cloud) DOCUMENT INGESTION

  • Watch folder (e.g., Watchdog) → auto-ingest on file add/modify/delete

  • Nested folder structure → auto-tagging

  • Supported formats: PDF, scanned PDF, DOCX, XLSX, CSV, JPG/PNG

  • Version control on re-upload QUERY & RETRIEVAL

- Restrict queries to a single client’s documents (no cross-client leakage)
- Structured queries (e.g., “Show invoices > ₹1 lakh”)
  • Comparative queries (e.g., “Compare FY23 vs FY24 gross profit”)
  • Keyword fallback

HIGHLIGHTING & RENDERING

  • Annotated PDF served to frontend
  • XLSX → colored cell export
  • Jump directly to highlighted page
  • Multi-document highlights in one response

ANSWER GENERATION

  • Local LLM only
  • Every claim cited with doc + page reference

MY QUESTIONS

Parsing: I’m considering LlamaIndex LiteParse.

→ Should I store document IDs + chunk IDs for PDFs to enable highlighting?

Vector DB:

  • Do I need one (e.g., Qdrant)?

  • If yes, how do I store doc IDs + chunk IDs alongside embeddings for highlighting?

  • Would pgvector in Postgres be sufficient?

GraphRAGs:

  • How effective are systems like Neo4j or Microsoft GraphRAG?

  • Can they run locally/offline, or are they too computationally heavy?

  • Is this GraphRAG pipeline from LlamaIndex a good starting point?

Highlighting UX:

  • I want something like Turnitin/iThenticate reports → exact sentence highlighted + citation.
  • Any open-source projects that already do this?
  • I found Kotaemon and AnythingLLM, which are close but don’t highlight documents.

TL;DR

Trying to build a local RAG system with:

  • Storage + ingestion + tagging
  • Query + retrieval + highlighting
  • Local LLM answer generation with citations

Looking for advice on:

  • Vector DB vs pgvector
  • GraphRAG feasibility offline
  • Best way to implement document highlighting + citation preview

Would love to hear from anyone who’s built something similar or explored these tools.

── more in #large-language-models 4 stories · sorted by recency
── more on @llamaindex 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/help-with-a-local-do…] indexed:0 read:2min 2026-06-20 ·