I Exposed My Local RAG as MCP Tools in Cursor — Now I Query My Private PDFs Without Leaving the IDE

A developer built a local RAG pipeline using Ollama, ChromaDB, and TinyLlama for private PDF Q&A, then exposed it as an MCP server so Cursor IDE can query documents directly. The project uses sentence-transformers all-MiniLM-L6-v2 for embeddings and llama3 for answers, enabling search and ask tools without leaving the editor.

Across my first two articles, I built a fully local RAG pipeline on my laptop — Ollama , ChromaDB , and TinyLlama for private PDF Q&A — then improved how documents are split by moving from fixed-size chunking to recursive chunking with overlap. Both projects worked well from the terminal. I could ingest a document and ask questions. But I kept running into the same friction: I had to leave my editor, open a terminal, copy questions back and forth, and remember which commands to run. That led me to my next experiment: expose the same RAG pipeline as an MCP server so Cursor can query my private documents as tools — while also trying a different local embedding model, all-MiniLM-L6-v2 , instead of Ollama for search vectors. This article is about what I built, how MCP fits in, and what changed when embeddings moved from Ollama to sentence-transformers. MCP stands for Model Context Protocol. Think of it like a standard plug for AI tools. Instead of every app inventing its own way to call your code, MCP gives assistants like Cursor a consistent way to discover and invoke tools you define — search a database, read a file, call your RAG pipeline. For this project, MCP is the bridge between: You still own the data. It still runs on your machine. Cursor just gets a clean way to call it. My earlier RAG demos were CLI-first. That is great for learning. It is less great when you are already inside an IDE and want answers from an internal PDF without context switching. I wanted three things: So I built a third project that adds: Same recursive chunking. Same ChromaDB. Same llama3 for answers. New interfaces and a new embedding path. Important: do not reuse an old .chroma/ folder when you switch embedding models. Different models produce different vector sizes. Mixing them breaks search. A common point of confusion in RAG: you usually need more than one model . This model does not write answers. It converts text into numbers vectors that capture meaning. In my first two projects, I pulled embeddings through Ollama. Here I use sentence-transformers instead: When you ask “What is the MFA policy?” and your document says “Multi-Factor Authentication must be configured before access is granted” , the embedding model helps the system understand those are related — even when the words differ. llama3 is ~8B parameters. It is not GPT-4. For learning and simple Q&A on your own docs, it is enough. It is used only when you want a written answer — CLI ask or MCP ask documents. Instead of typing terminal commands, Cursor can call: That split matters. Search is fast and great for debugging retrieval. Ask adds the LLM on top. Suppose I indexed an internal monthly expense PDF and asked: “What is highest expense in month of March-2026?” Step 1 — search documents retrieval only The tool returns the top chunks with source file, chunk index, and a short excerpt. I can verify retrieval found the right section — without trusting an LLM summary. Step 2 — ask documents full RAG The same question goes through retrieval, then llama3 writes an answer grounded in those chunks. I also get source excerpts so I can spot when the small model adds extra detail not in the text. That workflow taught me something my earlier CLI-only projects did not surface as clearly: always look at retrieved chunks before blaming the model. I kept the pipeline small and added a shared service layer plus MCP on top. Document .txt or .pdf → extract text → recursive split with overlap → MiniLM embed → store in ChromaDB I use ~500-character chunks with 50-character overlap — same tuning as my chunking article. First ingest downloads MiniLM automatically. Ollama is not required for this step. Preview: see how a file splits before embedding anything python main.py preview data/monthly-expense-data.pdf Search: test retrieval without calling llama3. python main.py search "What is a total expense in a month of March- 2026?" From Cursor, the same idea is: Use search documents "Give me highest expense in month of January?" Question → MiniLM embed → top-K similarity search → prompt llama3 → answer + sources CLI: python main.py ask "What MCP tools are available?" --show-sources Cursor: Use ask documents: "What MCP tools are available?" The main embedding change is in rag/embedder.py: python from sentence transformers import SentenceTransformerfrom rag.config import EMBED MODEL model = Nonedef get model - SentenceTransformer: global model if model is None: model = SentenceTransformer EMBED MODEL return modeldef embed texts texts: list str - list list float : if not texts: return model = get model vectors = model.encode texts, convert to numpy=True, show progress bar=False return vectors.tolist Model name and chunk settings live in one place — rag/config.py: EMBED MODEL = "all-MiniLM-L6-v2"LLM MODEL = "llama3"CHUNK SIZE = 500CHUNK OVERLAP = 50TOP K = 4 The MCP server reuses the same service functions as the CLI: python from mcp.server.fastmcp import FastMCPfrom rag import servicemcp = FastMCP "local-rag-mcp-minilm" @mcp.tool def search documents question: str, top k: int = 4 - dict: """Search indexed documents and return matching chunks no LLM answer .""" result = service.search documents question, top k=top k return {"ok": True, result}@mcp.tool def ask documents question: str, top k: int = 4 - dict: """Ask a question and get an answer grounded in indexed documents.""" result = service.ask documents question, top k=top k return {"ok": True, result}@mcp.tool def rag status - dict: """Return how many chunks are indexed and which source files exist.""" result = service.get index status return {"ok": True, result} Roughly 50 lines of MCP wiring on top of RAG code I already understood. That was the surprise — MCP felt abstract until I mapped three CLI operations to three tools. Add this to ~/.cursor/mcp.json adjust paths for your machine : { "mcpServers": { "local-rag-mcp-minilm": { "command": "C:\\path\\to\\local-rag-mcp-minilm\\.venv\\Scripts\\python.exe", "args": "C:\\path\\to\\local-rag-mcp-minilm\\mcp server.py" } }} Restart Cursor. The tools show up in chat. The app is still command-line driven for ingest and testing: Preview how a document is split no model download needed python main.py preview data/monthly-expense-data.txt Index a text or PDF file downloads MiniLM on first run python main.py ingest data/monthly-expense-data.txt Search without calling the LLMpython main.py search "Summerize expense for April month?" Ask a questionpython main.py ask "Summerize expense for April month?" Show retrieved sources with the answerpython main.py ask "What is recursive chunking?" --show-sources Check statuspython main.py status Clear the vector storepython main.py reset That is it. No web server. No cloud API. MCP is an extra entry point — not a replacement for the CLI while learning. I organized the code into small, focused modules: local-rag-mcp-minilm/├── main.py CLI: ingest, preview, search, ask, status, reset├── mcp server.py FastMCP tools for Cursor├── requirements.txt├── README.md├── docs/│ └── architecture.md Diagrams and pipeline breakdown├── rag/│ ├── config.py models, chunk size, overlap, top k│ ├── chunker.py RecursiveCharacterTextSplitter│ ├── document loader.py .txt and .pdf support│ ├── embedder.py all-MiniLM-L6-v2│ ├── vector store.py ChromaDB│ ├── query.py Retrieve + generate + source excerpts│ └── service.py Shared logic for CLI + MCP├── examples/│ └── cursor-mcp.json MCP config snippet└── data/ └── monthly-expense-data.pdf Demo document Each file has one responsibility. The service.py layer is the key design choice — CLI and MCP both call the same functions, so behavior stays in sync. This kind of project is great if you are: You do not need a GPU farm. You need Python, Ollama for ask only , and at least one document to index. From here, the natural extensions in this series are: But even in its current form, this project already shows the core lesson: your RAG pipeline can be a tool, not just a script. My first RAG article taught me the loop — ingest, embed, retrieve, generate. My second taught me that retrieval quality starts with chunking. This one taught me that once the loop works, the next step is making it callable — from your IDE, through MCP, on your own machine, on your own documents. If you are starting with MCP and RAG together, my advice is the same as before: build the smallest version first. Ingest one PDF. Call rag status. Run search documents on one question. Then try ask documents. Compare the chunks to the answer. That one end-to-end pass teaches more than reading ten protocol diagrams. You can download the complete source code from GitHub: https://github.com/parivshah/local-rag-mcp-minilm https://github.com/parivshah/local-rag-mcp-minilm The repository includes setup instructions, architecture diagrams, and an MCP config example. Please let me know if you liked this article or have any questions, feedback, or suggestions. You can connect with me on LinkedIn https://www.linkedin.com/in/parivshah . I Exposed My Local RAG as MCP Tools in Cursor — Now I Query My Private PDFs Without Leaving the IDE https://pub.towardsai.net/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-without-leaving-the-ide-41af4bbd0f91 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.