# I Exposed My Local RAG as MCP Tools in Cursor — Now I Query My Private PDFs Without Leaving the IDE

> Source: <https://pub.towardsai.net/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-without-leaving-the-ide-41af4bbd0f91?source=rss----98111c9905da---4>
> Published: 2026-06-16 07:52:30+00:00

Across my first two articles, I built a fully local RAG pipeline on my laptop — **Ollama**, **ChromaDB**, and **TinyLlama **for private PDF Q&A — then improved how documents are split by moving from fixed-size chunking to recursive chunking with overlap.

Both projects worked well from the terminal. I could ingest a document and ask questions. But I kept running into the same friction:

I had to leave my editor, open a terminal, copy questions back and forth, and remember which commands to run.

That led me to my next experiment: expose the same RAG pipeline as an **MCP server** so Cursor can query my private documents as tools — while also trying a different local embedding model, **all-MiniLM-L6-v2**, instead of Ollama for search vectors.

This article is about what I built, how MCP fits in, and what changed when embeddings moved from Ollama to sentence-transformers.

**MCP** stands for Model Context Protocol.

Think of it like a standard plug for AI tools. Instead of every app inventing its own way to call your code, MCP gives assistants like Cursor a consistent way to discover and invoke **tools** you define — search a database, read a file, call your RAG pipeline.

For this project, MCP is the bridge between:

You still own the data. It still runs on your machine. Cursor just gets a clean way to call it.

My earlier RAG demos were CLI-first. That is great for learning. It is less great when you are already inside an IDE and want answers from an internal PDF without context switching.

I wanted three things:

So I built a third project that adds:

Same recursive chunking. Same ChromaDB. Same llama3 for answers. New interfaces and a new embedding path.

**Important:** do not reuse an old .chroma/ folder when you switch embedding models. Different models produce different vector sizes. Mixing them breaks search.

A common point of confusion in RAG: you usually need **more than one model**.

This model does not write answers. It converts text into numbers (vectors) that capture meaning.

In my first two projects, I pulled embeddings through Ollama. Here I use **sentence-transformers** instead:

When you ask *“What is the MFA policy?”* and your document says *“Multi-Factor Authentication must be configured before access is granted”*, the embedding model helps the system understand those are related — even when the words differ.

llama3 is ~8B parameters. It is not GPT-4. For learning and simple Q&A on your own docs, it is enough.

It is used only when you want a **written answer** — CLI ask or MCP ask_documents.

Instead of typing terminal commands, Cursor can call:

That split matters. **Search** is fast and great for debugging retrieval. **Ask** adds the LLM on top.

Suppose I indexed an internal monthly expense PDF and asked:

*“What is highest expense in month of March-2026?”*

**Step 1 — ****search_documents (retrieval only)**

The tool returns the top chunks with source file, chunk index, and a short excerpt. I can verify retrieval found the right section — without trusting an LLM summary.

**Step 2 — ****ask_documents (full RAG)**

The same question goes through retrieval, then llama3 writes an answer grounded in those chunks. I also get source excerpts so I can spot when the small model adds extra detail not in the text.

That workflow taught me something my earlier CLI-only projects did not surface as clearly: **always look at retrieved chunks before blaming the model.**

I kept the pipeline small and added a shared service layer plus MCP on top.

Document (.txt or .pdf) → extract text → recursive split with overlap → MiniLM embed → store in ChromaDB

I use ~500-character chunks with 50-character overlap — same tuning as my chunking article. First ingest downloads MiniLM automatically.

Ollama is **not** required for this step.

**Preview:** see how a file splits before embedding anything

```
python main.py preview data/monthly-expense-data.pdf
```

**Search:** test retrieval without calling llama3.

```
python main.py search "What is a total expense in a month of March- 2026?"
```

From Cursor, the same idea is:

```
Use search_documents "Give me highest expense in month of January?"
```

Question → MiniLM embed → top-K similarity search → prompt llama3 → answer + sources

CLI:

```
python main.py ask "What MCP tools are available?" --show-sources
```

Cursor:

```
Use ask_documents: "What MCP tools are available?"
```

The main embedding change is in rag/embedder.py:

``` python
from sentence_transformers import SentenceTransformerfrom rag.config import EMBED_MODEL_model = Nonedef _get_model() -> SentenceTransformer:    global _model    if _model is None:        _model = SentenceTransformer(EMBED_MODEL)    return _modeldef embed_texts(texts: list[str]) -> list[list[float]]:    if not texts:        return []    model = _get_model()    vectors = model.encode(texts, convert_to_numpy=True, show_progress_bar=False)    return vectors.tolist()
```

Model name and chunk settings live in one place — rag/config.py:

```
EMBED_MODEL = "all-MiniLM-L6-v2"LLM_MODEL = "llama3"CHUNK_SIZE = 500CHUNK_OVERLAP = 50TOP_K = 4
```

The MCP server reuses the same service functions as the CLI:

``` python
from mcp.server.fastmcp import FastMCPfrom rag import servicemcp = FastMCP("local-rag-mcp-minilm")@mcp.tool()def search_documents(question: str, top_k: int = 4) -> dict:    """Search indexed documents and return matching chunks (no LLM answer)."""    result = service.search_documents(question, top_k=top_k)    return {"ok": True, **result}@mcp.tool()def ask_documents(question: str, top_k: int = 4) -> dict:    """Ask a question and get an answer grounded in indexed documents."""    result = service.ask_documents(question, top_k=top_k)    return {"ok": True, **result}@mcp.tool()def rag_status() -> dict:    """Return how many chunks are indexed and which source files exist."""    result = service.get_index_status()    return {"ok": True, **result}
```

Roughly **50 lines** of MCP wiring on top of RAG code I already understood. That was the surprise — MCP felt abstract until I mapped three CLI operations to three tools.

Add this to ~/.cursor/mcp.json (adjust paths for your machine):

```
{  "mcpServers": {    "local-rag-mcp-minilm": {      "command": "C:\\path\\to\\local-rag-mcp-minilm\\.venv\\Scripts\\python.exe",      "args": ["C:\\path\\to\\local-rag-mcp-minilm\\mcp_server.py"]    }  }}
```

Restart Cursor. The tools show up in chat.

The app is still command-line driven for ingest and testing:

```
# Preview how a document is split (no model download needed)python main.py preview data/monthly-expense-data.txt# Index a text or PDF file (downloads MiniLM on first run)python main.py ingest data/monthly-expense-data.txt# Search without calling the LLMpython main.py search "Summerize expense for April month?"# Ask a questionpython main.py ask "Summerize expense for April month?"# Show retrieved sources with the answerpython main.py ask "What is recursive chunking?" --show-sources# Check statuspython main.py status# Clear the vector storepython main.py reset
```

That is it. No web server. No cloud API. MCP is an extra entry point — not a replacement for the CLI while learning.

I organized the code into small, focused modules:

```
local-rag-mcp-minilm/├── main.py                      # CLI: ingest, preview, search, ask, status, reset├── mcp_server.py                # FastMCP tools for Cursor├── requirements.txt├── README.md├── docs/│   └── architecture.md          # Diagrams and pipeline breakdown├── rag/│   ├── config.py                # models, chunk size, overlap, top_k│   ├── chunker.py               # RecursiveCharacterTextSplitter│   ├── document_loader.py       # .txt and .pdf support│   ├── embedder.py              # all-MiniLM-L6-v2│   ├── vector_store.py          # ChromaDB│   ├── query.py                 # Retrieve + generate + source excerpts│   └── service.py               # Shared logic for CLI + MCP├── examples/│   └── cursor-mcp.json          # MCP config snippet└── data/    └── monthly-expense-data.pdf # Demo document
```

Each file has one responsibility. The **service.py** layer is the key design choice — CLI and MCP both call the same functions, so behavior stays in sync.

This kind of project is great if you are:

You do not need a GPU farm. You need Python, Ollama (for ask only), and at least one document to index.

From here, the natural extensions in this series are:

But even in its current form, this project already shows the core lesson: **your RAG pipeline can be a tool, not just a script.**

My first RAG article taught me the loop — ingest, embed, retrieve, generate.

My second taught me that retrieval quality starts with chunking.

This one taught me that once the loop works, the next step is making it **callable** — from your IDE, through MCP, on your own machine, on your own documents.

If you are starting with MCP and RAG together, my advice is the same as before: build the smallest version first. Ingest one PDF. Call rag_status. Run search_documents on one question. Then try ask_documents. Compare the chunks to the answer.

That one end-to-end pass teaches more than reading ten protocol diagrams.

You can download the complete source code from GitHub: [https://github.com/parivshah/local-rag-mcp-minilm](https://github.com/parivshah/local-rag-mcp-minilm)

The repository includes setup instructions, architecture diagrams, and an MCP config example. Please let me know if you liked this article or have any questions, feedback, or suggestions. You can connect with me on [LinkedIn](https://www.linkedin.com/in/parivshah).

[I Exposed My Local RAG as MCP Tools in Cursor — Now I Query My Private PDFs Without Leaving the IDE](https://pub.towardsai.net/i-exposed-my-local-rag-as-mcp-tools-in-cursor-now-i-query-my-private-pdfs-without-leaving-the-ide-41af4bbd0f91) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
