{"slug": "i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-the", "title": "I Exposed My Local RAG as MCP Tools in Cursor — Now I Query My Private PDFs Without Leaving the IDE", "summary": "A developer built a local RAG pipeline using Ollama, ChromaDB, and TinyLlama for private PDF Q&A, then exposed it as an MCP server so Cursor IDE can query documents directly. The project uses sentence-transformers all-MiniLM-L6-v2 for embeddings and llama3 for answers, enabling search and ask tools without leaving the editor.", "body_md": "Across my first two articles, I built a fully local RAG pipeline on my laptop — **Ollama**, **ChromaDB**, and **TinyLlama **for private PDF Q&A — then improved how documents are split by moving from fixed-size chunking to recursive chunking with overlap.\n\nBoth projects worked well from the terminal. I could ingest a document and ask questions. But I kept running into the same friction:\n\nI had to leave my editor, open a terminal, copy questions back and forth, and remember which commands to run.\n\nThat led me to my next experiment: expose the same RAG pipeline as an **MCP server** so Cursor can query my private documents as tools — while also trying a different local embedding model, **all-MiniLM-L6-v2**, instead of Ollama for search vectors.\n\nThis article is about what I built, how MCP fits in, and what changed when embeddings moved from Ollama to sentence-transformers.\n\n**MCP** stands for Model Context Protocol.\n\nThink of it like a standard plug for AI tools. Instead of every app inventing its own way to call your code, MCP gives assistants like Cursor a consistent way to discover and invoke **tools** you define — search a database, read a file, call your RAG pipeline.\n\nFor this project, MCP is the bridge between:\n\nYou still own the data. It still runs on your machine. Cursor just gets a clean way to call it.\n\nMy earlier RAG demos were CLI-first. That is great for learning. It is less great when you are already inside an IDE and want answers from an internal PDF without context switching.\n\nI wanted three things:\n\nSo I built a third project that adds:\n\nSame recursive chunking. Same ChromaDB. Same llama3 for answers. New interfaces and a new embedding path.\n\n**Important:** do not reuse an old .chroma/ folder when you switch embedding models. Different models produce different vector sizes. Mixing them breaks search.\n\nA common point of confusion in RAG: you usually need **more than one model**.\n\nThis model does not write answers. It converts text into numbers (vectors) that capture meaning.\n\nIn my first two projects, I pulled embeddings through Ollama. Here I use **sentence-transformers** instead:\n\nWhen you ask *“What is the MFA policy?”* and your document says *“Multi-Factor Authentication must be configured before access is granted”*, the embedding model helps the system understand those are related — even when the words differ.\n\nllama3 is ~8B parameters. It is not GPT-4. For learning and simple Q&A on your own docs, it is enough.\n\nIt is used only when you want a **written answer** — CLI ask or MCP ask_documents.\n\nInstead of typing terminal commands, Cursor can call:\n\nThat split matters. **Search** is fast and great for debugging retrieval. **Ask** adds the LLM on top.\n\nSuppose I indexed an internal monthly expense PDF and asked:\n\n*“What is highest expense in month of March-2026?”*\n\n**Step 1 — ****search_documents (retrieval only)**\n\nThe tool returns the top chunks with source file, chunk index, and a short excerpt. I can verify retrieval found the right section — without trusting an LLM summary.\n\n**Step 2 — ****ask_documents (full RAG)**\n\nThe same question goes through retrieval, then llama3 writes an answer grounded in those chunks. I also get source excerpts so I can spot when the small model adds extra detail not in the text.\n\nThat workflow taught me something my earlier CLI-only projects did not surface as clearly: **always look at retrieved chunks before blaming the model.**\n\nI kept the pipeline small and added a shared service layer plus MCP on top.\n\nDocument (.txt or .pdf) → extract text → recursive split with overlap → MiniLM embed → store in ChromaDB\n\nI use ~500-character chunks with 50-character overlap — same tuning as my chunking article. First ingest downloads MiniLM automatically.\n\nOllama is **not** required for this step.\n\n**Preview:** see how a file splits before embedding anything\n\n```\npython main.py preview data/monthly-expense-data.pdf\n```\n\n**Search:** test retrieval without calling llama3.\n\n```\npython main.py search \"What is a total expense in a month of March- 2026?\"\n```\n\nFrom Cursor, the same idea is:\n\n```\nUse search_documents \"Give me highest expense in month of January?\"\n```\n\nQuestion → MiniLM embed → top-K similarity search → prompt llama3 → answer + sources\n\nCLI:\n\n```\npython main.py ask \"What MCP tools are available?\" --show-sources\n```\n\nCursor:\n\n```\nUse ask_documents: \"What MCP tools are available?\"\n```\n\nThe main embedding change is in rag/embedder.py:\n\n``` python\nfrom sentence_transformers import SentenceTransformerfrom rag.config import EMBED_MODEL_model = Nonedef _get_model() -> SentenceTransformer:    global _model    if _model is None:        _model = SentenceTransformer(EMBED_MODEL)    return _modeldef embed_texts(texts: list[str]) -> list[list[float]]:    if not texts:        return []    model = _get_model()    vectors = model.encode(texts, convert_to_numpy=True, show_progress_bar=False)    return vectors.tolist()\n```\n\nModel name and chunk settings live in one place — rag/config.py:\n\n```\nEMBED_MODEL = \"all-MiniLM-L6-v2\"LLM_MODEL = \"llama3\"CHUNK_SIZE = 500CHUNK_OVERLAP = 50TOP_K = 4\n```\n\nThe MCP server reuses the same service functions as the CLI:\n\n``` python\nfrom mcp.server.fastmcp import FastMCPfrom rag import servicemcp = FastMCP(\"local-rag-mcp-minilm\")@mcp.tool()def search_documents(question: str, top_k: int = 4) -> dict:    \"\"\"Search indexed documents and return matching chunks (no LLM answer).\"\"\"    result = service.search_documents(question, top_k=top_k)    return {\"ok\": True, **result}@mcp.tool()def ask_documents(question: str, top_k: int = 4) -> dict:    \"\"\"Ask a question and get an answer grounded in indexed documents.\"\"\"    result = service.ask_documents(question, top_k=top_k)    return {\"ok\": True, **result}@mcp.tool()def rag_status() -> dict:    \"\"\"Return how many chunks are indexed and which source files exist.\"\"\"    result = service.get_index_status()    return {\"ok\": True, **result}\n```\n\nRoughly **50 lines** of MCP wiring on top of RAG code I already understood. That was the surprise — MCP felt abstract until I mapped three CLI operations to three tools.\n\nAdd this to ~/.cursor/mcp.json (adjust paths for your machine):\n\n```\n{  \"mcpServers\": {    \"local-rag-mcp-minilm\": {      \"command\": \"C:\\\\path\\\\to\\\\local-rag-mcp-minilm\\\\.venv\\\\Scripts\\\\python.exe\",      \"args\": [\"C:\\\\path\\\\to\\\\local-rag-mcp-minilm\\\\mcp_server.py\"]    }  }}\n```\n\nRestart Cursor. The tools show up in chat.\n\nThe app is still command-line driven for ingest and testing:\n\n```\n# Preview how a document is split (no model download needed)python main.py preview data/monthly-expense-data.txt# Index a text or PDF file (downloads MiniLM on first run)python main.py ingest data/monthly-expense-data.txt# Search without calling the LLMpython main.py search \"Summerize expense for April month?\"# Ask a questionpython main.py ask \"Summerize expense for April month?\"# Show retrieved sources with the answerpython main.py ask \"What is recursive chunking?\" --show-sources# Check statuspython main.py status# Clear the vector storepython main.py reset\n```\n\nThat is it. No web server. No cloud API. MCP is an extra entry point — not a replacement for the CLI while learning.\n\nI organized the code into small, focused modules:\n\n```\nlocal-rag-mcp-minilm/├── main.py                      # CLI: ingest, preview, search, ask, status, reset├── mcp_server.py                # FastMCP tools for Cursor├── requirements.txt├── README.md├── docs/│   └── architecture.md          # Diagrams and pipeline breakdown├── rag/│   ├── config.py                # models, chunk size, overlap, top_k│   ├── chunker.py               # RecursiveCharacterTextSplitter│   ├── document_loader.py       # .txt and .pdf support│   ├── embedder.py              # all-MiniLM-L6-v2│   ├── vector_store.py          # ChromaDB│   ├── query.py                 # Retrieve + generate + source excerpts│   └── service.py               # Shared logic for CLI + MCP├── examples/│   └── cursor-mcp.json          # MCP config snippet└── data/    └── monthly-expense-data.pdf # Demo document\n```\n\nEach file has one responsibility. The **service.py** layer is the key design choice — CLI and MCP both call the same functions, so behavior stays in sync.\n\nThis kind of project is great if you are:\n\nYou do not need a GPU farm. You need Python, Ollama (for ask only), and at least one document to index.\n\nFrom here, the natural extensions in this series are:\n\nBut even in its current form, this project already shows the core lesson: **your RAG pipeline can be a tool, not just a script.**\n\nMy first RAG article taught me the loop — ingest, embed, retrieve, generate.\n\nMy second taught me that retrieval quality starts with chunking.\n\nThis one taught me that once the loop works, the next step is making it **callable** — from your IDE, through MCP, on your own machine, on your own documents.\n\nIf you are starting with MCP and RAG together, my advice is the same as before: build the smallest version first. Ingest one PDF. Call rag_status. Run search_documents on one question. Then try ask_documents. Compare the chunks to the answer.\n\nThat one end-to-end pass teaches more than reading ten protocol diagrams.\n\nYou can download the complete source code from GitHub: [https://github.com/parivshah/local-rag-mcp-minilm](https://github.com/parivshah/local-rag-mcp-minilm)\n\nThe repository includes setup instructions, architecture diagrams, and an MCP config example. Please let me know if you liked this article or have any questions, feedback, or suggestions. You can connect with me on [LinkedIn](https://www.linkedin.com/in/parivshah).\n\n[I Exposed My Local RAG as MCP Tools in Cursor — Now I Query My Private PDFs Without Leaving the IDE](https://pub.towardsai.net/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-without-leaving-the-ide-41af4bbd0f91) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-the", "canonical_source": "https://pub.towardsai.net/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-without-leaving-the-ide-41af4bbd0f91?source=rss----98111c9905da---4", "published_at": "2026-06-16 07:52:30+00:00", "updated_at": "2026-06-16 08:25:17.488732+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "developer-tools", "natural-language-processing"], "entities": ["Ollama", "ChromaDB", "TinyLlama", "Cursor", "sentence-transformers", "all-MiniLM-L6-v2", "llama3", "MCP"], "alternates": {"html": "https://wpnews.pro/news/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-the", "markdown": "https://wpnews.pro/news/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-the.md", "text": "https://wpnews.pro/news/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-the.txt", "jsonld": "https://wpnews.pro/news/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-the.jsonld"}}