From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM

The article describes how to upgrade a simple AI campus assistant from using a hardcoded knowledge base in the prompt to a proper Retrieval-Augmented Generation (RAG) system using NVIDIA's hosted embedding model. The author explains that instead of pasting all information into every query, the system stores text chunks as vectors and retrieves only the most relevant ones at query time using NumPy and Python lists. The process involves embedding the user's question, comparing it to stored document embeddings, selecting the top matches, and sending only those to the LLM for a response.

In Part 1 https://dev.to/torkian/build-your-first-ai-app-with-nvidia-nim-in-30-minutes-1i43 , we built a USC campus assistant by pasting a five-line knowledge base directly into the prompt. That works when "the data" fits in your head. It stops being cute the moment the campus handbook, club docs, and workshop notes all want a seat at the same prompt window. The fix is retrieval — store the chunks once, and at query time pull only the few that look relevant. That's what RAG Retrieval-Augmented Generation actually means once you strip away the marketing. This post takes the assistant from Part 1 and bolts on a real retriever, using NVIDIA's hosted embedding model. No vector database, no LangChain, no abstraction layer. A Python list and NumPy are enough to understand what's actually happening. Once you've seen the moving parts, swapping in pgvector or Pinecone later is a fifteen-minute job. I'm B Torkian, NVIDIA Developer Champion at USC. Same workshop series, same campus, one more capability added. What you're adding User question → embed query → compare to stored chunks → pick top-k → send only those to the LLM → answer The model call itself barely changes. The work is in steps 2–4: turn text into vectors, compare vectors, return the closest chunks. Why the manual approach from Part 1 breaks In Part 1, the entire knowledge base sat inside the prompt: campus info = """ The USC AI Club meets every Thursday at 5 PM... The USC GPU computing lab is open Monday to Friday... ... """ Five lines is fine. But every model has a context window, and every token costs money and latency. You don't want to paste the entire USC student handbook into every question — most of it is irrelevant to "when does the AI Club meet?" Retrieval is the answer to "which 3 paragraphs out of 3000 are actually about this question?" You compute that before calling the LLM, then send only the winners. What an embedding actually is An embedding is a list of numbers a vector that represents the meaning of a piece of text. Two texts that mean similar things land near each other in vector space. Two texts that mean different things land far apart. NVIDIA's nv-embedqa-e5-v5 is an embedding model tuned specifically for question-answer retrieval. It has a quirk worth knowing about up front — it treats queries and passages differently. You tell it which one you're embedding via an input type parameter. Getting this wrong is the most common beginner mistake — it still runs, but retrieval quality drops noticeably. - input type='passage' → use for the documents you store - input type='query' → use for the user's question at search time That's it. Same model, two modes. Step 1: Set up the client and ask from Part 1 If you're continuing from Part 1, you already have these defined and can skip this cell. If you're starting fresh, paste this in first — everything later builds on it. python %pip install -q openai numpy import os, getpass from openai import OpenAI if not os.getenv 'NVIDIA API KEY' : os.environ 'NVIDIA API KEY' = getpass.getpass 'Paste your NVIDIA API key starts with nvapi- : ' client = OpenAI base url='https://integrate.api.nvidia.com/v1', api key=os.environ 'NVIDIA API KEY' , MODEL = 'meta/llama-3.1-8b-instruct' def ask system prompt, user message : response = client.chat.completions.create model=MODEL, messages= {'role': 'system', 'content': system prompt}, {'role': 'user', 'content': user message}, , temperature=0.3, max tokens=400, return response.choices 0 .message.content client calls NVIDIA's API Catalog. ask is the same chat-completion shape from Part 1. The retriever we're about to build slots in next to these, not instead of them. Step 2: Build a small knowledge base and embed it as passages python import numpy as np EMBED MODEL = 'nvidia/nv-embedqa-e5-v5' knowledge base = {'title': 'USC AI Club meeting', 'text': 'The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.'}, {'title': 'USC GPU lab hours', 'text': 'The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.'}, {'title': 'NVIDIA Developer Program', 'text': 'USC students can join the NVIDIA Developer Program for free.'}, {'title': 'Next USC workshop', 'text': 'The next USC AI Club workshop will cover Retrieval Augmented Generation RAG .'}, {'title': 'USC AI/ML office hours', 'text': 'Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.'}, {'title': 'USC robotics lab', 'text': 'The USC robotics lab requires safety training before students can use the soldering station.'}, {'title': 'USC tutoring', 'text': 'Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM.'}, def embed texts texts, input type='passage' : response = client.embeddings.create model=EMBED MODEL, input=texts, extra body={'input type': input type}, return np.array item.embedding, dtype=np.float32 for item in response.data Embed every chunk once, as a passage. Store the vector alongside the text. embeddings = embed texts item 'text' for item in knowledge base , input type='passage' for item, embedding in zip knowledge base, embeddings : item 'embedding' = embedding print f'Embedded {len knowledge base } chunks. Vector dim:', embeddings 0 .shape 0 Two things to notice: - The OpenAI Python client doesn't have a native field for NVIDIA's input type , so we pass it through extra body . That's the right way to send provider-specific arguments without forking the client. - We're storing the embeddings in plain Python dicts. For seven chunks this is fine. For seven thousand, you'd reach for a vector database and the only thing that changes is where the vectors live; the cosine math is identical . Step 3: Retrieve the top-k chunks for a question python def cosine similarity a, b : denominator = np.linalg.norm a np.linalg.norm b if denominator == 0: return 0.0 return float np.dot a, b / denominator def retrieve context question, k=3 : question embedding = embed texts question , input type='query' 0 scored = for item in knowledge base: score = cosine similarity question embedding, item 'embedding' scored.append score, item scored.sort key=lambda pair: pair 0 , reverse=True top items = item for score, item in scored :k return '\n'.join f"- {item 'text' }" for item in top items Three things are happening here: - The question is embedded as a , not a query passage . This is the part beginners trip over. Same model, different mode. - Cosine similarity scores how close the question vector is to each stored chunk vector. Numbers near 1.0 mean very similar; numbers near 0 mean unrelated. - Top-k picks the highest-scoring chunks. Three is a reasonable default for a tiny knowledge base; tune it for yours. There is no magic in step 3. A vector database would do the same comparison but use indexing tricks to do it fast at scale. Step 4: Plug retrieval into the same ask from Part 1 python def ask with retrieval question : context = retrieve context question system prompt = f"""You are a USC campus assistant. Answer ONLY using the context below. If the answer is not in the context, say "I don't have that information — check with the USC AI Club." CONTEXT: {context} """ return ask system prompt, question for question in 'Where does the USC AI Club meet?', 'When can I get Python tutoring at USC?', 'What is the wifi password?', : print f'Q: {question}' print f'Context:\n{retrieve context question }' print f'A: {ask with retrieval question }\n' Run it. Three things to read carefully: - The first question retrieves the AI Club chunk and answers from it. Good. - The second retrieves the tutoring chunk and answers from it. Notice that "Python tutoring" doesn't appear verbatim in the stored text — the chunk says "introductory Python" — but the embedding model knows those are semantically close. That's the whole point of vector search over keyword search. - The wifi question retrieves three chunks anyway top-k always returns k items , but none of them contain a password. The assistant falls back to the refusal line because the ONLY using the context rule forces it to. That's the guardrail from Part 1 doing its job — and it's exactly the bridge into Part 3. Step 5: What you actually did You replaced the hand-picked campus info string from Part 1 with a real retrieval step. The model call is identical, and the system prompt follows the same guardrail pattern — answer only from the provided context, otherwise fall back. The only structural change is that {context} now comes from a function instead of a hardcoded constant. That swap is the entire mental model behind RAG. Real production systems add chunking strategies, hybrid search, re-ranking, and a vector database — but the spine stays the same: embed once, embed query, compare, pass top-k to the LLM. In your own work, the seven-line knowledge base becomes hundreds of paragraphs scraped from PDFs, lecture notes, club Slack archives, Notion pages, or a wiki. The retriever code doesn't change. The dict-with-vector storage gets replaced by something like pgvector, Qdrant, or Pinecone the moment you outgrow a Python list. Get the code Repo: github.com/torkian/nvidia-nim-workshop https://github.com/torkian/nvidia-nim-workshop One-click Colab for Part 2: Open part2 rag.ipynb https://colab.research.google.com/github/torkian/nvidia-nim-workshop/blob/main/part2 rag.ipynb Local Python: part2 rag.py in the repo python3 part2 rag.py after pip install -r requirements.txt .MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are. Previously / next in this series - Part 1: Build Your First AI App with NVIDIA NIM in 30 Minutes https://dev.to/torkian/build-your-first-ai-app-with-nvidia-nim-in-30-minutes-1i43 - Part 3 next : Add Guardrails So It Doesn't Lie — a two-layer approach using prompt scope + a tiny verifier call. The fallback line that fired on the wifi question above is the foundation we build on.