{"slug": "from-manual-rag-to-real-retrieval-embedding-based-rag-with-nvidia-nim", "title": "From Manual RAG to Real Retrieval — Embedding-Based RAG with NVIDIA NIM", "summary": "The article describes how to upgrade a simple AI campus assistant from using a hardcoded knowledge base in the prompt to a proper Retrieval-Augmented Generation (RAG) system using NVIDIA's hosted embedding model. The author explains that instead of pasting all information into every query, the system stores text chunks as vectors and retrieves only the most relevant ones at query time using NumPy and Python lists. The process involves embedding the user's question, comparing it to stored document embeddings, selecting the top matches, and sending only those to the LLM for a response.", "body_md": "In [Part 1](https://dev.to/torkian/build-your-first-ai-app-with-nvidia-nim-in-30-minutes-1i43), we built a USC campus assistant by pasting a five-line knowledge base directly into the prompt. That works when \"the data\" fits in your head. It stops being cute the moment the campus handbook, club docs, and workshop notes all want a seat at the same prompt window.\n\nThe fix is retrieval — store the chunks once, and at query time pull only the few that look relevant. That's what RAG (Retrieval-Augmented Generation) actually means once you strip away the marketing.\n\nThis post takes the assistant from Part 1 and bolts on a real retriever, using NVIDIA's hosted embedding model. No vector database, no LangChain, no abstraction layer. A Python list and NumPy are enough to understand what's actually happening. Once you've seen the moving parts, swapping in pgvector or Pinecone later is a fifteen-minute job.\n\nI'm B Torkian, NVIDIA Developer Champion at USC. Same workshop series, same campus, one more capability added.\n\n## What you're adding\n\n```\nUser question → embed query → compare to stored chunks → pick top-k → send only those to the LLM → answer\n```\n\nThe model call itself barely changes. The work is in steps 2–4: turn text into vectors, compare vectors, return the closest chunks.\n\n## Why the manual approach from Part 1 breaks\n\nIn Part 1, the entire knowledge base sat inside the prompt:\n\n```\ncampus_info = \"\"\"\nThe USC AI Club meets every Thursday at 5 PM...\nThe USC GPU computing lab is open Monday to Friday...\n...\n\"\"\"\n```\n\nFive lines is fine. But every model has a context window, and every token costs money and latency. You don't want to paste the entire USC student handbook into every question — most of it is irrelevant to \"when does the AI Club meet?\"\n\nRetrieval is the answer to \"which 3 paragraphs out of 3000 are actually about this question?\" You compute that *before* calling the LLM, then send only the winners.\n\n## What an embedding actually is\n\nAn embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two texts that mean similar things land near each other in vector space. Two texts that mean different things land far apart.\n\nNVIDIA's `nv-embedqa-e5-v5`\n\nis an embedding model tuned specifically for question-answer retrieval. It has a quirk worth knowing about up front — it treats **queries** and **passages** differently. You tell it which one you're embedding via an `input_type`\n\nparameter. Getting this wrong is the most common beginner mistake — it still runs, but retrieval quality drops noticeably.\n\n-\n`input_type='passage'`\n\n→ use for the documents you store -\n`input_type='query'`\n\n→ use for the user's question at search time\n\nThat's it. Same model, two modes.\n\n## Step 1: Set up the client and `ask()`\n\nfrom Part 1\n\nIf you're continuing from Part 1, you already have these defined and can skip this cell. If you're starting fresh, paste this in first — everything later builds on it.\n\n``` python\n%pip install -q openai numpy\n\nimport os, getpass\nfrom openai import OpenAI\n\nif not os.getenv('NVIDIA_API_KEY'):\n    os.environ['NVIDIA_API_KEY'] = getpass.getpass('Paste your NVIDIA API key (starts with nvapi-): ')\n\nclient = OpenAI(\n    base_url='https://integrate.api.nvidia.com/v1',\n    api_key=os.environ['NVIDIA_API_KEY'],\n)\n\nMODEL = 'meta/llama-3.1-8b-instruct'\n\ndef ask(system_prompt, user_message):\n    response = client.chat.completions.create(\n        model=MODEL,\n        messages=[\n            {'role': 'system', 'content': system_prompt},\n            {'role': 'user',   'content': user_message},\n        ],\n        temperature=0.3,\n        max_tokens=400,\n    )\n    return response.choices[0].message.content\n```\n\n`client`\n\ncalls NVIDIA's API Catalog. `ask()`\n\nis the same chat-completion shape from Part 1. The retriever we're about to build slots in next to these, not instead of them.\n\n## Step 2: Build a small knowledge base and embed it as passages\n\n``` python\nimport numpy as np\n\nEMBED_MODEL = 'nvidia/nv-embedqa-e5-v5'\n\nknowledge_base = [\n    {'title': 'USC AI Club meeting',\n     'text': 'The USC AI Club meets every Thursday at 5 PM in the engineering building, room 204.'},\n    {'title': 'USC GPU lab hours',\n     'text': 'The USC GPU computing lab is open Monday to Friday from 10 AM to 6 PM.'},\n    {'title': 'NVIDIA Developer Program',\n     'text': 'USC students can join the NVIDIA Developer Program for free.'},\n    {'title': 'Next USC workshop',\n     'text': 'The next USC AI Club workshop will cover Retrieval Augmented Generation (RAG).'},\n    {'title': 'USC AI/ML office hours',\n     'text': 'Office hours for the USC AI/ML faculty are Tuesdays 2-4 PM.'},\n    {'title': 'USC robotics lab',\n     'text': 'The USC robotics lab requires safety training before students can use the soldering station.'},\n    {'title': 'USC tutoring',\n     'text': 'Peer tutoring for introductory Python at USC is available Wednesdays from 1 PM to 3 PM.'},\n]\n\ndef embed_texts(texts, input_type='passage'):\n    response = client.embeddings.create(\n        model=EMBED_MODEL,\n        input=texts,\n        extra_body={'input_type': input_type},\n    )\n    return [np.array(item.embedding, dtype=np.float32) for item in response.data]\n\n# Embed every chunk once, as a passage. Store the vector alongside the text.\nembeddings = embed_texts([item['text'] for item in knowledge_base], input_type='passage')\nfor item, embedding in zip(knowledge_base, embeddings):\n    item['embedding'] = embedding\n\nprint(f'Embedded {len(knowledge_base)} chunks. Vector dim:', embeddings[0].shape[0])\n```\n\nTwo things to notice:\n\n- The OpenAI Python client doesn't have a native field for NVIDIA's\n`input_type`\n\n, so we pass it through`extra_body`\n\n. That's the right way to send provider-specific arguments without forking the client. - We're storing the embeddings in plain Python dicts. For seven chunks this is fine. For seven thousand, you'd reach for a vector database (and the only thing that changes is\n*where*the vectors live; the cosine math is identical).\n\n## Step 3: Retrieve the top-k chunks for a question\n\n``` python\ndef cosine_similarity(a, b):\n    denominator = np.linalg.norm(a) * np.linalg.norm(b)\n    if denominator == 0:\n        return 0.0\n    return float(np.dot(a, b) / denominator)\n\ndef retrieve_context(question, k=3):\n    question_embedding = embed_texts([question], input_type='query')[0]\n\n    scored = []\n    for item in knowledge_base:\n        score = cosine_similarity(question_embedding, item['embedding'])\n        scored.append((score, item))\n\n    scored.sort(key=lambda pair: pair[0], reverse=True)\n    top_items = [item for score, item in scored[:k]]\n\n    return '\\n'.join(f\"- {item['text']}\" for item in top_items)\n```\n\nThree things are happening here:\n\n-\n**The question is embedded as a**, not a`query`\n\n`passage`\n\n. This is the part beginners trip over. Same model, different mode. -\n**Cosine similarity** scores how close the question vector is to each stored chunk vector. Numbers near 1.0 mean very similar; numbers near 0 mean unrelated. -\n**Top-k** picks the highest-scoring chunks. Three is a reasonable default for a tiny knowledge base; tune it for yours.\n\nThere is no magic in step 3. A vector database would do the same comparison but use indexing tricks to do it fast at scale.\n\n## Step 4: Plug retrieval into the same `ask()`\n\nfrom Part 1\n\n``` python\ndef ask_with_retrieval(question):\n    context = retrieve_context(question)\n\n    system_prompt = f\"\"\"You are a USC campus assistant. Answer ONLY using the\ncontext below. If the answer is not in the context, say\n\"I don't have that information — check with the USC AI Club.\"\n\nCONTEXT:\n{context}\n\"\"\"\n\n    return ask(system_prompt, question)\n\nfor question in [\n    'Where does the USC AI Club meet?',\n    'When can I get Python tutoring at USC?',\n    'What is the wifi password?',\n]:\n    print(f'Q: {question}')\n    print(f'Context:\\n{retrieve_context(question)}')\n    print(f'A: {ask_with_retrieval(question)}\\n')\n```\n\nRun it. Three things to read carefully:\n\n- The\n**first question** retrieves the AI Club chunk and answers from it. Good. - The\n**second** retrieves the tutoring chunk and answers from it. Notice that \"Python tutoring\" doesn't appear verbatim in the stored text — the chunk says \"introductory Python\" — but the embedding model knows those are semantically close. That's the whole point of vector search over keyword search. - The\n**wifi question** retrieves three chunks anyway (top-k always returns*k*items), but none of them contain a password. The assistant falls back to the refusal line because the`ONLY using the context`\n\nrule forces it to. That's the guardrail from Part 1 doing its job — and it's exactly the bridge into Part 3.\n\n## Step 5: What you actually did\n\nYou replaced the hand-picked `campus_info`\n\nstring from Part 1 with a real retrieval step. The model call is identical, and the system prompt follows the same guardrail pattern — answer only from the provided context, otherwise fall back. The only structural change is that `{context}`\n\nnow comes from a function instead of a hardcoded constant.\n\nThat swap is the entire mental model behind RAG. Real production systems add chunking strategies, hybrid search, re-ranking, and a vector database — but the spine stays the same: embed once, embed query, compare, pass top-k to the LLM.\n\nIn your own work, the seven-line `knowledge_base`\n\nbecomes hundreds of paragraphs scraped from PDFs, lecture notes, club Slack archives, Notion pages, or a wiki. The retriever code doesn't change. The dict-with-vector storage gets replaced by something like pgvector, Qdrant, or Pinecone the moment you outgrow a Python list.\n\n## Get the code\n\n**Repo:** [github.com/torkian/nvidia-nim-workshop](https://github.com/torkian/nvidia-nim-workshop)\n\n**One-click Colab for Part 2:** [Open part2_rag.ipynb](https://colab.research.google.com/github/torkian/nvidia-nim-workshop/blob/main/part2_rag.ipynb)\n\n**Local Python:**\n\n`part2_rag.py`\n\nin the repo (`python3 part2_rag.py`\n\nafter `pip install -r requirements.txt`\n\n).MIT licensed. I run this at USC — fork it, swap the knowledge base for your school, your club, your project, and run it wherever you are.\n\n## Previously / next in this series\n\n-\n**Part 1:**[Build Your First AI App with NVIDIA NIM in 30 Minutes](https://dev.to/torkian/build-your-first-ai-app-with-nvidia-nim-in-30-minutes-1i43) -\n**Part 3 (next):** Add Guardrails So It Doesn't Lie — a two-layer approach using prompt scope + a tiny verifier call. The fallback line that fired on the wifi question above is the foundation we build on.", "url": "https://wpnews.pro/news/from-manual-rag-to-real-retrieval-embedding-based-rag-with-nvidia-nim", "canonical_source": "https://dev.to/torkian/from-manual-rag-to-real-retrieval-embedding-based-rag-with-nvidia-nim-44fa", "published_at": "2026-05-23 00:33:15+00:00", "updated_at": "2026-05-23 01:03:16.449654+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "developer-tools", "data"], "entities": ["NVIDIA", "B Torkian", "USC", "NVIDIA NIM", "LangChain", "Pinecone", "pgvector", "NumPy"], "alternates": {"html": "https://wpnews.pro/news/from-manual-rag-to-real-retrieval-embedding-based-rag-with-nvidia-nim", "markdown": "https://wpnews.pro/news/from-manual-rag-to-real-retrieval-embedding-based-rag-with-nvidia-nim.md", "text": "https://wpnews.pro/news/from-manual-rag-to-real-retrieval-embedding-based-rag-with-nvidia-nim.txt", "jsonld": "https://wpnews.pro/news/from-manual-rag-to-real-retrieval-embedding-based-rag-with-nvidia-nim.jsonld"}}