# Running a Whole RAG Agent Offline: LangGraph + Ollama + Embedded Qdrant (Zero API Keys)

> Source: <https://dev.to/javaking1129/running-a-whole-rag-agent-offline-langgraph-ollama-embedded-qdrant-zero-api-keys-2hfd>
> Published: 2026-06-29 01:22:31+00:00

Most RAG tutorials open with "set your `OPENAI_API_KEY`

." This one doesn't need it. In [Part 1](https://dev.to/javaking1129/running-a-langgraph-react-agent-in-production-openai-compatible-api-multi-model-gateway--emi) I claimed the LLM and embeddings are behind a swappable boundary — "switch providers via config, not code." Part 3 is me *cashing that claim*: running the entire RAG agent — ingestion, retrieval, the ReAct loop, source citations — on a laptop with **zero API keys and no Docker**, just Ollama and an embedded Qdrant.

Everything below is real output from an actual run. Including the one thing that broke.

Three pieces, all local:

```
  ollama pull qwen3.5:9b   # chat / reasoning
  ollama pull bge-m3       # embeddings (1024-dim, multilingual)
CHAT_PROVIDER=ollama
```

That's it. No `OPENAI_API_KEY`

, no `docker compose up`

. The reason this is a *flip* and not a rewrite is the provider-swap design from Part 1 — let's look at the three factories that make it work.

``` php
# app/llm/embeddings.py
@lru_cache
def get_embeddings() -> Embeddings:
    s = get_settings()
    provider = s.embedding_provider.lower()

    if provider == "ollama":
        from langchain_ollama import OllamaEmbeddings
        return OllamaEmbeddings(model=s.embedding_model, base_url=s.ollama_url)

    if provider == "openai":
        from langchain_openai import OpenAIEmbeddings
        return OpenAIEmbeddings(base_url=f"{s.litellm_url}/v1",
                                api_key=s.litellm_key, model=s.embedding_model)

    raise ValueError(f"unknown embedding_provider: {s.embedding_provider!r}")
```

Both branches return the same LangChain `Embeddings`

interface, so the ingestion and retrieval code never knows which one it got. Local dev → Ollama (offline). Production → OpenAI via the gateway. **One caveat that matters later:** the two providers produce *different vector dimensions*, so you can't mix vectors ingested with one and queried with the other. More on that in the gotchas.

``` php
# app/rag/store.py
@lru_cache
def get_client() -> QdrantClient:
    s = get_settings()
    if s.qdrant_url:
        return QdrantClient(url=s.qdrant_url, api_key=s.qdrant_api_key)  # remote (prod)
    return QdrantClient(path=s.qdrant_path)                             # embedded (local)
```

No `QDRANT_URL`

? You get an embedded client that persists to `s.qdrant_path`

— a plain directory. Set `QDRANT_URL`

in prod and the *same code* talks to a real Qdrant service. The trade-off of embedded mode: it **locks the directory to a single process**, which becomes gotcha #2.

The ingest script is the whole pipeline in ~30 lines: load files, split them, probe the embedding dimension, create the collection, upsert.

```
# scripts/ingest.py (trimmed)
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
chunks = splitter.split_documents(documents)

# probe the embedding dimension so the collection matches the provider
dim = len(get_embeddings().embed_query("probe"))
ensure_collection(dim)
get_vector_store().add_documents(chunks)
```

The `embed_query("probe")`

trick is worth pausing on: instead of hard-coding `1024`

for bge-m3 (or `1536`

for OpenAI), it asks the active embedder for one vector and measures it. Swap the provider and the collection is created with the right size automatically.

Running it for real:

``` bash
$ python scripts/ingest.py --reset
[ingest] source=docs  collection=docs  embed=ollama:bge-m3
[ingest] 5 documents → 53 chunks
[ingest] embedding dim = 1024
[ingest] done — 53 points in collection
```

Five markdown files, 53 chunks, 1024-dim vectors from bge-m3, written to the local Qdrant directory. No network calls left the machine.

You can hit the FastAPI endpoint, but to *see the graph think* you can also invoke it directly. Here's a real run, asking about something that lives in the docs:

```
res = await graph.ainvoke({"messages": [HumanMessage(content=
    "How is short-term vs long-term memory implemented in this project?")]})

print([type(m).__name__ for m in res["messages"]])
# ['HumanMessage', 'AIMessage', 'ToolMessage', 'AIMessage']
```

That message sequence *is* the ReAct loop, visible in the state:

`HumanMessage`

— the question`AIMessage`

with `tool_calls=[search_docs(...)]`

— the model decides to retrieve`ToolMessage`

— the retrieved chunks come back`AIMessage`

— the final synthesized answerAnd the answer itself, generated entirely by a 9B model on the laptop:

```
Short-term memory: PostgreSQL (PostgresSaver) stores per-thread
  conversation state; swappable to Redis (RedisSaver) if needed.
Long-term memory: Zep manages the user's persistent knowledge,
  recalled by the app on later turns.

Sources: <doc-a>.md, <doc-b>.md
```

Grounded in the actual docs, with source attribution, zero API keys. That's the win. Now the part the tutorials skip.

On one run, the *exact same question* produced this:

```
[1] AIMessage   content=''   tool_calls=[search_docs(...)]   finish_reason='tool_calls'
[2] ToolMessage content='[1] (source: ...) ## memory layers ...'   ← retrieval worked
[3] AIMessage   content=''   tool_calls=[]   finish_reason='stop'  ← empty answer
```

Retrieval succeeded. The chunks were right there in step 2. But step 3 — the model's job to *read the chunks and answer* — came back **empty**. `finish_reason='stop'`

, no tokens, no error. Re-running the same question gave a perfectly good 280-character answer with citations. So it's **intermittent**: a small local model occasionally produces an empty turn after a tool call.

Two things to take away:

`saw_token`

fallback from `ainvoke`

when no tokens stream, but here `ainvoke`

Embedded mode keeps the store in one process. Run the ingest script while the server is up and you'll get a lock error. Order matters: **ingest first → let it exit → then start the server.** The ingest script even closes the client explicitly to avoid a noisy shutdown traceback.

bge-m3 is 1024-dim; OpenAI's `text-embedding-3-small`

is 1536. If you ingest with one provider and query with another, the dimensions don't line up and search breaks. Switching `embedding_provider`

means **re-ingesting** (`--reset`

). The `embed_query("probe")`

dimension check is exactly what keeps the collection honest per provider.

Ollama loads the model into memory on first use. The first request eats that cost; subsequent ones are fast. Don't benchmark the cold start.

You can build, debug, and demo the *entire* RAG agent — graph, retrieval, citations — on a plane with no wifi. Then, for production, you flip two config values (`CHAT_PROVIDER`

, `QDRANT_URL`

) and the same code talks to a hosted model and a real Qdrant cluster. Part 1 *claimed* the provider boundary; Part 3 *ran on both sides of it*.

The flip side is honesty about local models: retrieval is rock-solid, but a 9B model's synthesis step is the weak link, and it'll occasionally hand you an empty answer. Know that going in.

Next: persisting conversation threads with a checkpointer — so the agent remembers across requests — and what that adds to the message log you just saw.

*Part 3 of a series on running LangGraph in production. Part 1 · Part 2.*