Running a Whole RAG Agent Offline: LangGraph + Ollama + Embedded Qdrant (Zero API Keys) A developer demonstrates running a complete RAG agent offline using LangGraph, Ollama, and an embedded Qdrant instance, requiring zero API keys or Docker. The system uses a provider-swap design that allows switching between local Ollama and remote OpenAI via configuration, with a probe trick to automatically detect embedding dimensions. The implementation successfully ingests documents and performs retrieval-augmented generation entirely on a laptop. Most RAG tutorials open with "set your OPENAI API KEY ." This one doesn't need it. In Part 1 https://dev.to/javaking1129/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model-gateway--emi I claimed the LLM and embeddings are behind a swappable boundary — "switch providers via config, not code." Part 3 is me cashing that claim : running the entire RAG agent — ingestion, retrieval, the ReAct loop, source citations — on a laptop with zero API keys and no Docker , just Ollama and an embedded Qdrant. Everything below is real output from an actual run. Including the one thing that broke. Three pieces, all local: ollama pull qwen3.5:9b chat / reasoning ollama pull bge-m3 embeddings 1024-dim, multilingual CHAT PROVIDER=ollama That's it. No OPENAI API KEY , no docker compose up . The reason this is a flip and not a rewrite is the provider-swap design from Part 1 — let's look at the three factories that make it work. php app/llm/embeddings.py @lru cache def get embeddings - Embeddings: s = get settings provider = s.embedding provider.lower if provider == "ollama": from langchain ollama import OllamaEmbeddings return OllamaEmbeddings model=s.embedding model, base url=s.ollama url if provider == "openai": from langchain openai import OpenAIEmbeddings return OpenAIEmbeddings base url=f"{s.litellm url}/v1", api key=s.litellm key, model=s.embedding model raise ValueError f"unknown embedding provider: {s.embedding provider r}" Both branches return the same LangChain Embeddings interface, so the ingestion and retrieval code never knows which one it got. Local dev → Ollama offline . Production → OpenAI via the gateway. One caveat that matters later: the two providers produce different vector dimensions , so you can't mix vectors ingested with one and queried with the other. More on that in the gotchas. php app/rag/store.py @lru cache def get client - QdrantClient: s = get settings if s.qdrant url: return QdrantClient url=s.qdrant url, api key=s.qdrant api key remote prod return QdrantClient path=s.qdrant path embedded local No QDRANT URL ? You get an embedded client that persists to s.qdrant path — a plain directory. Set QDRANT URL in prod and the same code talks to a real Qdrant service. The trade-off of embedded mode: it locks the directory to a single process , which becomes gotcha 2. The ingest script is the whole pipeline in ~30 lines: load files, split them, probe the embedding dimension, create the collection, upsert. scripts/ingest.py trimmed splitter = RecursiveCharacterTextSplitter chunk size=1000, chunk overlap=150 chunks = splitter.split documents documents probe the embedding dimension so the collection matches the provider dim = len get embeddings .embed query "probe" ensure collection dim get vector store .add documents chunks The embed query "probe" trick is worth pausing on: instead of hard-coding 1024 for bge-m3 or 1536 for OpenAI , it asks the active embedder for one vector and measures it. Swap the provider and the collection is created with the right size automatically. Running it for real: bash $ python scripts/ingest.py --reset ingest source=docs collection=docs embed=ollama:bge-m3 ingest 5 documents → 53 chunks ingest embedding dim = 1024 ingest done — 53 points in collection Five markdown files, 53 chunks, 1024-dim vectors from bge-m3, written to the local Qdrant directory. No network calls left the machine. You can hit the FastAPI endpoint, but to see the graph think you can also invoke it directly. Here's a real run, asking about something that lives in the docs: res = await graph.ainvoke {"messages": HumanMessage content= "How is short-term vs long-term memory implemented in this project?" } print type m . name for m in res "messages" 'HumanMessage', 'AIMessage', 'ToolMessage', 'AIMessage' That message sequence is the ReAct loop, visible in the state: HumanMessage — the question AIMessage with tool calls= search docs ... — the model decides to retrieve ToolMessage — the retrieved chunks come back AIMessage — the final synthesized answerAnd the answer itself, generated entirely by a 9B model on the laptop: Short-term memory: PostgreSQL PostgresSaver stores per-thread conversation state; swappable to Redis RedisSaver if needed. Long-term memory: Zep manages the user's persistent knowledge, recalled by the app on later turns. Sources: