{"slug": "running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys", "title": "Running a Whole RAG Agent Offline: LangGraph + Ollama + Embedded Qdrant (Zero API Keys)", "summary": "A developer demonstrates running a complete RAG agent offline using LangGraph, Ollama, and an embedded Qdrant instance, requiring zero API keys or Docker. The system uses a provider-swap design that allows switching between local Ollama and remote OpenAI via configuration, with a probe trick to automatically detect embedding dimensions. The implementation successfully ingests documents and performs retrieval-augmented generation entirely on a laptop.", "body_md": "Most RAG tutorials open with \"set your `OPENAI_API_KEY`\n\n.\" This one doesn't need it. In [Part 1](https://dev.to/javaking1129/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model-gateway--emi) I claimed the LLM and embeddings are behind a swappable boundary — \"switch providers via config, not code.\" Part 3 is me *cashing that claim*: running the entire RAG agent — ingestion, retrieval, the ReAct loop, source citations — on a laptop with **zero API keys and no Docker**, just Ollama and an embedded Qdrant.\n\nEverything below is real output from an actual run. Including the one thing that broke.\n\nThree pieces, all local:\n\n```\n  ollama pull qwen3.5:9b   # chat / reasoning\n  ollama pull bge-m3       # embeddings (1024-dim, multilingual)\nCHAT_PROVIDER=ollama\n```\n\nThat's it. No `OPENAI_API_KEY`\n\n, no `docker compose up`\n\n. The reason this is a *flip* and not a rewrite is the provider-swap design from Part 1 — let's look at the three factories that make it work.\n\n``` php\n# app/llm/embeddings.py\n@lru_cache\ndef get_embeddings() -> Embeddings:\n    s = get_settings()\n    provider = s.embedding_provider.lower()\n\n    if provider == \"ollama\":\n        from langchain_ollama import OllamaEmbeddings\n        return OllamaEmbeddings(model=s.embedding_model, base_url=s.ollama_url)\n\n    if provider == \"openai\":\n        from langchain_openai import OpenAIEmbeddings\n        return OpenAIEmbeddings(base_url=f\"{s.litellm_url}/v1\",\n                                api_key=s.litellm_key, model=s.embedding_model)\n\n    raise ValueError(f\"unknown embedding_provider: {s.embedding_provider!r}\")\n```\n\nBoth branches return the same LangChain `Embeddings`\n\ninterface, so the ingestion and retrieval code never knows which one it got. Local dev → Ollama (offline). Production → OpenAI via the gateway. **One caveat that matters later:** the two providers produce *different vector dimensions*, so you can't mix vectors ingested with one and queried with the other. More on that in the gotchas.\n\n``` php\n# app/rag/store.py\n@lru_cache\ndef get_client() -> QdrantClient:\n    s = get_settings()\n    if s.qdrant_url:\n        return QdrantClient(url=s.qdrant_url, api_key=s.qdrant_api_key)  # remote (prod)\n    return QdrantClient(path=s.qdrant_path)                             # embedded (local)\n```\n\nNo `QDRANT_URL`\n\n? You get an embedded client that persists to `s.qdrant_path`\n\n— a plain directory. Set `QDRANT_URL`\n\nin prod and the *same code* talks to a real Qdrant service. The trade-off of embedded mode: it **locks the directory to a single process**, which becomes gotcha #2.\n\nThe ingest script is the whole pipeline in ~30 lines: load files, split them, probe the embedding dimension, create the collection, upsert.\n\n```\n# scripts/ingest.py (trimmed)\nsplitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)\nchunks = splitter.split_documents(documents)\n\n# probe the embedding dimension so the collection matches the provider\ndim = len(get_embeddings().embed_query(\"probe\"))\nensure_collection(dim)\nget_vector_store().add_documents(chunks)\n```\n\nThe `embed_query(\"probe\")`\n\ntrick is worth pausing on: instead of hard-coding `1024`\n\nfor bge-m3 (or `1536`\n\nfor OpenAI), it asks the active embedder for one vector and measures it. Swap the provider and the collection is created with the right size automatically.\n\nRunning it for real:\n\n``` bash\n$ python scripts/ingest.py --reset\n[ingest] source=docs  collection=docs  embed=ollama:bge-m3\n[ingest] 5 documents → 53 chunks\n[ingest] embedding dim = 1024\n[ingest] done — 53 points in collection\n```\n\nFive markdown files, 53 chunks, 1024-dim vectors from bge-m3, written to the local Qdrant directory. No network calls left the machine.\n\nYou can hit the FastAPI endpoint, but to *see the graph think* you can also invoke it directly. Here's a real run, asking about something that lives in the docs:\n\n```\nres = await graph.ainvoke({\"messages\": [HumanMessage(content=\n    \"How is short-term vs long-term memory implemented in this project?\")]})\n\nprint([type(m).__name__ for m in res[\"messages\"]])\n# ['HumanMessage', 'AIMessage', 'ToolMessage', 'AIMessage']\n```\n\nThat message sequence *is* the ReAct loop, visible in the state:\n\n`HumanMessage`\n\n— the question`AIMessage`\n\nwith `tool_calls=[search_docs(...)]`\n\n— the model decides to retrieve`ToolMessage`\n\n— the retrieved chunks come back`AIMessage`\n\n— the final synthesized answerAnd the answer itself, generated entirely by a 9B model on the laptop:\n\n```\nShort-term memory: PostgreSQL (PostgresSaver) stores per-thread\n  conversation state; swappable to Redis (RedisSaver) if needed.\nLong-term memory: Zep manages the user's persistent knowledge,\n  recalled by the app on later turns.\n\nSources: <doc-a>.md, <doc-b>.md\n```\n\nGrounded in the actual docs, with source attribution, zero API keys. That's the win. Now the part the tutorials skip.\n\nOn one run, the *exact same question* produced this:\n\n```\n[1] AIMessage   content=''   tool_calls=[search_docs(...)]   finish_reason='tool_calls'\n[2] ToolMessage content='[1] (source: ...) ## memory layers ...'   ← retrieval worked\n[3] AIMessage   content=''   tool_calls=[]   finish_reason='stop'  ← empty answer\n```\n\nRetrieval succeeded. The chunks were right there in step 2. But step 3 — the model's job to *read the chunks and answer* — came back **empty**. `finish_reason='stop'`\n\n, no tokens, no error. Re-running the same question gave a perfectly good 280-character answer with citations. So it's **intermittent**: a small local model occasionally produces an empty turn after a tool call.\n\nTwo things to take away:\n\n`saw_token`\n\nfallback from `ainvoke`\n\nwhen no tokens stream, but here `ainvoke`\n\nEmbedded mode keeps the store in one process. Run the ingest script while the server is up and you'll get a lock error. Order matters: **ingest first → let it exit → then start the server.** The ingest script even closes the client explicitly to avoid a noisy shutdown traceback.\n\nbge-m3 is 1024-dim; OpenAI's `text-embedding-3-small`\n\nis 1536. If you ingest with one provider and query with another, the dimensions don't line up and search breaks. Switching `embedding_provider`\n\nmeans **re-ingesting** (`--reset`\n\n). The `embed_query(\"probe\")`\n\ndimension check is exactly what keeps the collection honest per provider.\n\nOllama loads the model into memory on first use. The first request eats that cost; subsequent ones are fast. Don't benchmark the cold start.\n\nYou can build, debug, and demo the *entire* RAG agent — graph, retrieval, citations — on a plane with no wifi. Then, for production, you flip two config values (`CHAT_PROVIDER`\n\n, `QDRANT_URL`\n\n) and the same code talks to a hosted model and a real Qdrant cluster. Part 1 *claimed* the provider boundary; Part 3 *ran on both sides of it*.\n\nThe flip side is honesty about local models: retrieval is rock-solid, but a 9B model's synthesis step is the weak link, and it'll occasionally hand you an empty answer. Know that going in.\n\nNext: persisting conversation threads with a checkpointer — so the agent remembers across requests — and what that adds to the message log you just saw.\n\n*Part 3 of a series on running LangGraph in production. Part 1 · Part 2.*", "url": "https://wpnews.pro/news/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys", "canonical_source": "https://dev.to/javaking1129/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys-2hfd", "published_at": "2026-06-29 01:22:31+00:00", "updated_at": "2026-06-29 02:27:14.223013+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-agents", "natural-language-processing", "machine-learning"], "entities": ["LangGraph", "Ollama", "Qdrant", "LangChain", "OpenAI", "bge-m3", "qwen3.5", "FastAPI"], "alternates": {"html": "https://wpnews.pro/news/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys", "markdown": "https://wpnews.pro/news/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys.md", "text": "https://wpnews.pro/news/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys.txt", "jsonld": "https://wpnews.pro/news/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys.jsonld"}}