cd /news/large-language-models/running-a-whole-rag-agent-offline-la… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-42838] src=dev.to β†— pub= topic=large-language-models verified=true sentiment=↑ positive

Running a Whole RAG Agent Offline: LangGraph + Ollama + Embedded Qdrant (Zero API Keys)

A developer demonstrates running a complete RAG agent offline using LangGraph, Ollama, and an embedded Qdrant instance, requiring zero API keys or Docker. The system uses a provider-swap design that allows switching between local Ollama and remote OpenAI via configuration, with a probe trick to automatically detect embedding dimensions. The implementation successfully ingests documents and performs retrieval-augmented generation entirely on a laptop.

read5 min views1 publishedJun 29, 2026

Most RAG tutorials open with "set your OPENAI_API_KEY

." This one doesn't need it. In Part 1 I claimed the LLM and embeddings are behind a swappable boundary β€” "switch providers via config, not code." Part 3 is me cashing that claim: running the entire RAG agent β€” ingestion, retrieval, the ReAct loop, source citations β€” on a laptop with zero API keys and no Docker, just Ollama and an embedded Qdrant.

Everything below is real output from an actual run. Including the one thing that broke.

Three pieces, all local:

  ollama pull qwen3.5:9b   # chat / reasoning
  ollama pull bge-m3       # embeddings (1024-dim, multilingual)
CHAT_PROVIDER=ollama

That's it. No OPENAI_API_KEY

, no docker compose up

. The reason this is a flip and not a rewrite is the provider-swap design from Part 1 β€” let's look at the three factories that make it work.

@lru_cache
def get_embeddings() -> Embeddings:
    s = get_settings()
    provider = s.embedding_provider.lower()

    if provider == "ollama":
        from langchain_ollama import OllamaEmbeddings
        return OllamaEmbeddings(model=s.embedding_model, base_url=s.ollama_url)

    if provider == "openai":
        from langchain_openai import OpenAIEmbeddings
        return OpenAIEmbeddings(base_url=f"{s.litellm_url}/v1",
                                api_key=s.litellm_key, model=s.embedding_model)

    raise ValueError(f"unknown embedding_provider: {s.embedding_provider!r}")

Both branches return the same LangChain Embeddings

interface, so the ingestion and retrieval code never knows which one it got. Local dev β†’ Ollama (offline). Production β†’ OpenAI via the gateway. One caveat that matters later: the two providers produce different vector dimensions, so you can't mix vectors ingested with one and queried with the other. More on that in the gotchas.

@lru_cache
def get_client() -> QdrantClient:
    s = get_settings()
    if s.qdrant_url:
        return QdrantClient(url=s.qdrant_url, api_key=s.qdrant_api_key)  # remote (prod)
    return QdrantClient(path=s.qdrant_path)                             # embedded (local)

No QDRANT_URL

? You get an embedded client that persists to s.qdrant_path

β€” a plain directory. Set QDRANT_URL

in prod and the same code talks to a real Qdrant service. The trade-off of embedded mode: it locks the directory to a single process, which becomes gotcha #2.

The ingest script is the whole pipeline in ~30 lines: load files, split them, probe the embedding dimension, create the collection, upsert.

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
chunks = splitter.split_documents(documents)

dim = len(get_embeddings().embed_query("probe"))
ensure_collection(dim)
get_vector_store().add_documents(chunks)

The embed_query("probe")

trick is worth pausing on: instead of hard-coding 1024

for bge-m3 (or 1536

for OpenAI), it asks the active embedder for one vector and measures it. Swap the provider and the collection is created with the right size automatically.

Running it for real:

$ python scripts/ingest.py --reset
[ingest] source=docs  collection=docs  embed=ollama:bge-m3
[ingest] 5 documents β†’ 53 chunks
[ingest] embedding dim = 1024
[ingest] done β€” 53 points in collection

Five markdown files, 53 chunks, 1024-dim vectors from bge-m3, written to the local Qdrant directory. No network calls left the machine.

You can hit the FastAPI endpoint, but to see the graph think you can also invoke it directly. Here's a real run, asking about something that lives in the docs:

res = await graph.ainvoke({"messages": [HumanMessage(content=
    "How is short-term vs long-term memory implemented in this project?")]})

print([type(m).__name__ for m in res["messages"]])

That message sequence is the ReAct loop, visible in the state:

HumanMessage

β€” the questionAIMessage

with tool_calls=[search_docs(...)]

β€” the model decides to retrieveToolMessage

β€” the retrieved chunks come backAIMessage

β€” the final synthesized answerAnd the answer itself, generated entirely by a 9B model on the laptop:

Short-term memory: PostgreSQL (PostgresSaver) stores per-thread
  conversation state; swappable to Redis (RedisSaver) if needed.
Long-term memory: Zep manages the user's persistent knowledge,
  recalled by the app on later turns.

Sources: <doc-a>.md, <doc-b>.md

Grounded in the actual docs, with source attribution, zero API keys. That's the win. Now the part the tutorials skip.

On one run, the exact same question produced this:

[1] AIMessage   content=''   tool_calls=[search_docs(...)]   finish_reason='tool_calls'
[2] ToolMessage content='[1] (source: ...) ## memory layers ...'   ← retrieval worked
[3] AIMessage   content=''   tool_calls=[]   finish_reason='stop'  ← empty answer

Retrieval succeeded. The chunks were right there in step 2. But step 3 β€” the model's job to read the chunks and answer β€” came back empty. finish_reason='stop'

, no tokens, no error. Re-running the same question gave a perfectly good 280-character answer with citations. So it's intermittent: a small local model occasionally produces an empty turn after a tool call.

Two things to take away:

saw_token

fallback from ainvoke

when no tokens stream, but here ainvoke

Embedded mode keeps the store in one process. Run the ingest script while the server is up and you'll get a lock error. Order matters: ingest first β†’ let it exit β†’ then start the server. The ingest script even closes the client explicitly to avoid a noisy shutdown traceback.

bge-m3 is 1024-dim; OpenAI's text-embedding-3-small

is 1536. If you ingest with one provider and query with another, the dimensions don't line up and search breaks. Switching embedding_provider

means re-ingesting (--reset

). The embed_query("probe")

dimension check is exactly what keeps the collection honest per provider.

Ollama loads the model into memory on first use. The first request eats that cost; subsequent ones are fast. Don't benchmark the cold start.

You can build, debug, and demo the entire RAG agent β€” graph, retrieval, citations β€” on a plane with no wifi. Then, for production, you flip two config values (CHAT_PROVIDER

, QDRANT_URL

) and the same code talks to a hosted model and a real Qdrant cluster. Part 1 claimed the provider boundary; Part 3 ran on both sides of it.

The flip side is honesty about local models: retrieval is rock-solid, but a 9B model's synthesis step is the weak link, and it'll occasionally hand you an empty answer. Know that going in.

Next: persisting conversation threads with a checkpointer β€” so the agent remembers across requests β€” and what that adds to the message log you just saw.

Part 3 of a series on running LangGraph in production. Part 1 Β· Part 2.

── more in #large-language-models 4 stories Β· sorted by recency
gist.github.com Β· Β· #large-language-models
config.json
── more on @langgraph 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/running-a-whole-rag-…] indexed:0 read:5min 2026-06-29 Β· β€”