{"slug": "rag-pipeline-the-uncle-nephew-complete-learning-guide", "title": "RAG Pipeline: The Uncle-Nephew Complete Learning Guide", "summary": "A developer explains that Retrieval-Augmented Generation (RAG) fixes AI hallucinations by having the model fetch relevant documents before answering, rather than relying solely on memorized training data. The guide breaks down RAG into three steps and highlights that the main challenge is finding the right documents quickly and handling synonyms.", "body_md": "How to Build Systems That Actually Know Your Data (Not Hallucinate About It)\n\n👦 **Nephew:** Uncle, I keep hearing \"RAG this, RAG that\" in tech interviews. When I ask what it means, people throw around words like \"Retrieval-Augmented Generation\" and I just nod like I understand. But honestly? I'm lost.\n\n👨🦳 **Uncle:** (laughing) That's the best honest question I've heard all week. Let me ask you something first. If I gave you a question right now - \"What year did India win the World Cup?\" - how would you answer?\n\n👦 **Nephew:** Well... I'd pull up Google, search for it, read the answer, then tell you.\n\n👨🦳 **Uncle:** Exactly. You don't answer from memory alone. You go *fetch the information* first, then *answer based on what you found*. That's RAG in real life. And that simple idea - fetch first, answer after - fixes almost every problem we face with AI today.\n\n👦 **Nephew:** But uncle, AI can remember things from its training. Why does it need to fetch?\n\n👨🦳 **Uncle:** Ah! That's where we land in trouble. Come, sit...\n\n👨🦳 **Uncle:** Imagine you're hiring for a tech company. You receive 500 resumes for a Senior React Developer role. Now tell me - how would you actually process them?\n\n👦 **Nephew:** I'd... probably make a spreadsheet? List all the candidates with key skills?\n\n👨🦳 **Uncle:** Right. But here's the catch - you can't read all 500 resumes deeply. So what do you really do?\n\n👦 **Nephew:** Skim for keywords like \"React\", \"JavaScript\", \"5 years\"?\n\n👨🦳 **Uncle:** Exactly. You skim and hope you don't miss anyone good. Now, here's the problem: what if a candidate wrote \"React.js\" instead of \"React\"? Your eyes might still catch it. But a dumb computer doing exact string matching? It says \"no match\".\n\nWhat if someone wrote \"Built real-time user interfaces with the React framework\"? The candidate clearly knows React, but the word \"React\" appears nowhere in that sentence. The computer misses them.\n\nThis is exactly what happens with AI. When you ask an AI a question, it tries to answer purely from what it memorized during training. And it makes mistakes - sometimes big ones.\n\n👦 **Nephew:** So RAG stops these mistakes?\n\n👨🦳 **Uncle:** Precisely. RAG is simple: instead of asking the AI to guess, you hand it the document first. You say, \"Here's the resume, here's the job description - now answer my question.\"\n\nThe AI stops guessing. It reads the evidence. It answers correctly.\n\n👦 **Nephew:** Okay, so RAG = give the AI documents, then ask questions?\n\n👨🦳 **Uncle:** Yes, but we need to be precise about *how* we give it documents. RAG has three steps:\n\nThat's it. R-A-G.\n\n👦 **Nephew:** But if it's that simple, why is everyone talking about it like it's complicated?\n\n👨🦳 **Uncle:** Because \"finding the right documents\" is the hard part. You have 500 resumes. When someone asks \"Does John have Docker experience?\", you can't search all 500 linearly. That's slow.\n\nAnd you can't use simple keyword search either - what if John wrote \"containerization\" instead of \"Docker\"? What if he wrote \"I work with Kubernetes\" - which means he knows Docker too?\n\nSo the real question becomes: How do you find the *right* documents fast, and how do you understand when two different words mean the same thing?\n\nThis is where everything else - embeddings, vector databases, chunking - comes in. They're all in service of solving that one problem.\n\n👦 **Nephew:** Okay, I think I get the big picture. But uncle, why can't we just make the AI smarter?\n\n👨🦳 **Uncle:** Two reasons. First - you train an AI once. After that, it doesn't learn new information. If your company has 100 internal policies created last month, the AI knows nothing about them. It can't learn them instantly.\n\nSecond - even if you could retrain it, hallucinations would still happen. AIs are pattern-matching machines. They're brilliant at patterns. But they sometimes see patterns that aren't there. RAG forces the AI to cite its sources, to point at evidence.\n\nIt's the difference between:\n\nThe second one is what you want. That's RAG.\n\n👦 **Nephew:** Uncle, I have a question. How does a computer know that \"React\" and \"React.js\" are the same thing?\n\n👨🦳 **Uncle:** That's the right question. Let me explain with a story.\n\nI go to a fruit market. I see apples, mangoes, oranges. How do I know which is which? I look at them - color, shape, smell. My brain recognizes patterns and says \"that's a mango\".\n\nNow, how does a computer do that? It doesn't have eyes. All it has are numbers.\n\n👦 **Nephew:** Numbers?\n\n👨🦳 **Uncle:** Yes. Here's the trick: every word - \"React\", \"React.js\", \"JavaScript\", \"Python\" - is just a number to a computer. Or actually, a list of numbers. We call this list an \"embedding\".\n\n👦 **Nephew:** Like a list of... what kind of numbers?\n\n👨🦳 **Uncle:** Imagine I'm describing a person to you. I might say:\n\nThat's 4 numbers describing 1 person. Now imagine I use 1536 numbers instead. Each number describes a different quality - not just physical things, but hidden things like \"how technical is this word\", \"is this related to web development\", \"how often is this used\", and so on.\n\nSo \"React\" becomes a point in a 1536-dimensional space. \"React.js\" becomes another point in that same space. And because they mean the same thing, those two points are *very close to each other*.\n\nBut \"React\" and \"Python\"? Those points are far apart.\n\n👦 **Nephew:** Okay, so closeness = similarity?\n\n👨🦳 **Uncle:** Exactly. And here's the beautiful part: you don't have to manually create these embeddings. An AI model does it for you. You feed it the word \"React\" and it says, \"This word should be represented as [0.12, -0.45, 0.78, ..., 1536 numbers]\".\n\nDifferent models might represent it differently, but the same model always represents similar concepts closely. That's what matters.\n\n👦 **Nephew:** So embeddings let computers understand meaning?\n\n👨🦳 **Uncle:** They let computers represent meaning *as a position in space*. And when you represent things as positions, you can do math with them.\n\nFor example, if I tell you: \"Engineer\" - \"Java\" + \"React\" = ?\n\n👦 **Nephew:** That's... weird math?\n\n👨🦳 **Uncle:** Right? But with embeddings, you can actually do this. And you know what the answer is?\n\nThe vector closest to that result is usually \"web developer\" or something similar. The math captures meaning.\n\nNow here's why this matters for RAG:\n\nWhen someone asks \"Does John have Docker experience?\", we convert that question to an embedding - a point in 1536D space. Then we search for resume chunks that are close to that point. The chunks mentioning Docker - whether they say \"Docker\", \"containerization\", \"container orchestration\", or \"I deploy with Docker\" - are all close to the question's embedding.\n\nSo we find them all. And the AI reads them. And the AI answers correctly.\n\nWithout embeddings, simple keyword search would miss half of them.\n\n👦 **Nephew:** Okay, so we have embeddings. But uncle, if I have a company with 10,000 resumes, and each resume has 50 chunks, that's 500,000 embeddings. Each embedding is 1536 numbers. How do I store and search that?\n\n👨🦳 **Uncle:** This is where most people choose poorly. They think: \"I'll use Pinecone! I'll use Milvus! I'll use Weaviate!\"\n\nThey add three systems. One for storage, one for search, one for cache. Now they have 3 things that could break. 3 things to debug. 3 bills to pay.\n\nThere's a better way.\n\n👦 **Nephew:** What?\n\n👨🦳 **Uncle:** PostgreSQL.\n\n👦 **Nephew:** The database? Just... PostgreSQL?\n\n👨🦳 **Uncle:** With one addition: the pgvector extension. PostgreSQL is already reliable, it's already running your data, it's already scaling. We just add vector support.\n\nThink about it this way: 500,000 vectors stored in PostgreSQL as data. When you have a question, you convert it to an embedding, and you say:\n\n``` js\nSELECT * FROM resume_chunks \nORDER BY embedding <=> question_embedding \nLIMIT 5\n```\n\nThat `<=>`\n\noperator means \"find the 5 vectors closest to my question vector\". PostgreSQL finds them in milliseconds. You're done.\n\nNo new infrastructure. No vendor lock-in. Just one database doing one job well.\n\n👦 **Nephew:** But doesn't the <=> operator need an index to be fast?\n\n👨🦳 **Uncle:** Good catch. Yes. PostgreSQL's pgvector creates an IVFFLAT index by default. Think of it like this:\n\nWhen you have 500,000 resumes and need to find the 5 closest to a question, searching all 500,000 linearly is slow. The index divides the 1536D space into regions (like neighborhoods). When you search, it says \"The question's embedding is closest to this neighborhood\" and only searches that neighborhood. Much faster.\n\nA linear search of 500K vectors? 5 seconds. With the index? 50-100ms.\n\nThat's the magic of the right index.\n\n👦 **Nephew:** How do I actually create this?\n\n👨🦳 **Uncle:** Simple. PostgreSQL + pgvector. Here's what you do:\n\n```\n-- Create extension\nCREATE EXTENSION IF NOT EXISTS vector;\n\n-- Create table with vector column\nCREATE TABLE resume_chunks (\n  id SERIAL PRIMARY KEY,\n  resume_id UUID,\n  chunk_text TEXT,\n  embedding vector(1536),\n  created_at TIMESTAMP DEFAULT NOW()\n);\n\n-- Create index for fast search\nCREATE INDEX ON resume_chunks USING ivfflat (embedding vector_cosine_ops)\n  WITH (lists = 100);\n```\n\nNow you have a table. You insert chunks with their embeddings. You query with the `<=>`\n\noperator. Done.\n\n👦 **Nephew:** That's it?\n\n👨🦳 **Uncle:** That's it. No API calls to external services. No paying per query. Just a single, reliable database.\n\nThis is the beauty of choosing the right tool.\n\n👦 **Nephew:** Uncle, we have embeddings, we have a database. How does actual retrieval happen?\n\n👨🦳 **Uncle:** Let me walk you through a real scenario. A hiring manager asks: \"Does Priya have blockchain experience?\"\n\n**Step 1: Understand the question**\n\nWe need to find chunks about blockchain skills.\n\n**Step 2: Convert question to embedding**\n\n\"blockchain experience\" becomes a vector: [0.34, -0.12, ..., 1536 numbers]\n\n**Step 3: Search the database**\n\nWe query:\n\n```\nSELECT chunk_text, similarity FROM resume_chunks \nWHERE resume_id = 'priya-uuid'\nORDER BY embedding <=> question_embedding\nLIMIT 5\n```\n\n**Step 4: Get results** (might be):\n\n**Step 5: Send to AI**\n\nWe put these 5 chunks in the prompt:\n\n```\nBased on the following resume excerpts:\n[chunk 1]\n[chunk 2]\n[chunk 3]\n[chunk 4]\n[chunk 5]\n\nQuestion: Does Priya have blockchain experience?\n```\n\n**Step 6: AI answers**\n\n\"Yes, Priya has strong blockchain experience. She has worked with Ethereum smart contracts, understands Bitcoin protocol, and has built on Web3/DeFi platforms using Solidity.\"\n\nWith evidence.\n\n👦 **Nephew:** This seems perfect. What's the problem?\n\n👨🦳 **Uncle:** The problem is subtle. Resume has this line:\n\n\"I worked on distributed ledger technology and consensus protocols\"\n\nThis person knows blockchain. But the word \"blockchain\" is not there.\n\nYou search for \"blockchain\". Embeddings are smart, but not magic. \"Distributed ledger technology\" and \"blockchain\" are close but not identical.\n\nSometimes the vector search returns this chunk. Sometimes it doesn't.\n\nBasic retrieval works maybe 70% of the time. For production, you need 95%+.\n\nThat's why we need advanced techniques.\n\n👦 **Nephew:** So what makes it better?\n\n👨🦳 **Uncle:** Multiple techniques working together. But we'll get there. First, let's talk about preparation.\n\n👦 **Nephew:** Uncle, let me ask something. Why do we break resumes into chunks at all? Why not just embed the whole resume?\n\n👨🦳 **Uncle:** Good question. Two reasons:\n\n**Cost** - An embedding costs money (usually). A 50-page resume broken into chunks might cost $0.50 instead of $5 if you embed it whole.\n\n**Precision** - If you embed the entire resume as one chunk, and you search for \"Docker experience\", the system returns the whole 50-page resume. The AI has to find the Docker mention itself. But if you break it into chunks, the system returns only the chunks about Docker. The AI reads less noise.\n\nBut here's the problem: how small should chunks be?\n\n👦 **Nephew:** Can't I just pick a number? Like 500 tokens per chunk?\n\n👨🦳 **Uncle:** Let's see what happens with different sizes.\n\n**Attempt 1: Too small (100 tokens)**\n\nChunk 1: \"John has\"\n\nChunk 2: \"5 years of\"\n\nChunk 3: \"React experience\"\n\nProblem: When we read Chunk 3, we've lost context. 5 years of *what*? The AI is confused.\n\n**Attempt 2: Right size (1000 tokens)**\n\n\"John has 5 years of React experience, built e-commerce platforms, Redux for state management, and mentored junior developers.\"\n\nPerfect. Context is clear.\n\n**Attempt 3: Too large (5000 tokens)**\n\nEntire work history section, with all jobs, all skills, everything mixed together.\n\nProblem: Too much noise. When the AI searches for \"React\", it gets 5000 tokens to read, but only 2 sentences mention React. It's slow and confusing.\n\n👨🦳 **Uncle:** The sweet spot is usually 1000-1500 tokens per chunk. Not less, not more.\n\n👦 **Nephew:** What's overlap? Overlapping chunks?\n\n👨🦳 **Uncle:** Yes. Imagine this:\n\n**Without overlap:**\n\nChunk 1 ends: \"...John used React...\"\n\nChunk 2 starts: \"...for building e-commerce...\"\n\nWhen you read Chunk 2 alone, you don't know John used React for this. Context is broken.\n\n**With 200-token overlap:**\n\nChunk 1: \"...John has React experience. He built e-commerce...\"\n\nChunk 2: \"...He built e-commerce with Redux and WebSockets...\"\n\nNow each chunk contains enough context. You can read them individually and still understand.\n\n👦 **Nephew:** So overlap = context continuity?\n\n👨🦳 **Uncle:** Exactly. It's like reading a book. Each page overlaps the previous one slightly. You always have context.\n\n👦 **Nephew:** Are there different ways to chunk?\n\n👨🦳 **Uncle:** Yes. Here are the main ones:\n\n| Strategy | Chunk Size | Overlap | Best For | Trade-off |\n|---|---|---|---|---|\n| Naive/Fixed | 512 tokens | None | Simple code | Loses context |\n| Sliding Window | 1000-1500 | 200 | Resumes, documents | Balanced |\n| Semantic | Variable | Variable | Technical docs | Complex to implement |\n| Recursive | 1000 first, then smaller | 200 | Large documents | More processing |\n\nFor resumes? Sliding window, 1000 tokens, 200-token overlap. It's what works best.\n\nHere's example code:\n\n``` python\ndef sliding_window_chunk(text, window_size=1000, overlap=200):\n    \"\"\"\n    Break text into chunks with overlap.\n    window_size = tokens per chunk\n    overlap = tokens of overlap between chunks\n    \"\"\"\n    tokens = text.split()  # Simplified tokenization\n    chunks = []\n\n    for i in range(0, len(tokens), window_size - overlap):\n        chunk = tokens[i:i + window_size]\n        chunks.append(\" \".join(chunk))\n\n        if i + window_size >= len(tokens):\n            break\n\n    return chunks\n\n# Example\nresume_text = \"John has 5 years React... [long text]\"\nchunks = sliding_window_chunk(resume_text, 1000, 200)\n# chunks[0] = \"John has 5 years React... [1000 tokens]\"\n# chunks[1] = \"[last 200 tokens of chunk 0] ... [next 800 tokens]\"\n```\n\nSimple. Effective. This is what production systems use.\n\n👦 **Nephew:** Uncle, embeddings are smart. But are they perfect?\n\n👨🦳 **Uncle:** No. Here's a real problem:\n\nResume has: \"Kubernetes 1.26\"\n\nYou search for: \"Kubernetes 1.26\"\n\nVector search thinks:\n\nFALSE MATCH! The versions are different, but vectors say they're similar.\n\n👦 **Nephew:** So vectors miss exact matches?\n\n👨🦳 **Uncle:** Not miss - they're fuzzy. They blur differences. Sometimes that's good (finding \"React\" when you search \"React.js\"). Sometimes it's bad (confusing version numbers).\n\nThis is why you need both: vectors AND keywords.\n\n👦 **Nephew:** How does keyword search work?\n\n👨🦳 **Uncle:** Simple. Exact string matching.\n\nResume: \"Kubernetes 1.26\"\n\nSearch: \"1.26\"\n\nResult: EXACT MATCH. Yes.\n\nResume: \"Kubernetes 1.25\"\n\nSearch: \"1.26\"\n\nResult: NO MATCH. No.\n\nIt's binary. No fuzzy. Just right or wrong.\n\n👦 **Nephew:** So we use both?\n\n👨🦳 **Uncle:** Exactly. Here's how:\n\n**Search \"Kubernetes 1.26\"**\n\nStep 1: Vector search on all chunks\n\nStep 2: Keyword filter\n\nStep 3: Combine and rerank\n\nResult: Perfect accuracy. Both semantic understanding AND exact precision.\n\nPostgreSQL can do this:\n\n```\nSELECT chunk_text \nFROM resume_chunks \nWHERE resume_id = 'john-uuid'\n  AND to_tsvector(chunk_text) @@ plainto_tsquery('Kubernetes 1.26')\n  AND embedding <=> question_embedding < 0.2\nORDER BY embedding <=> question_embedding\nLIMIT 5\n```\n\nThis query says:\n\nBoth working together.\n\n👦 **Nephew:** When do I use pure vector?\n\n👨🦳 **Uncle:** When you're searching for concepts:\n\nVectors are great here. They find related concepts even with different words.\n\n👦 **Nephew:** And pure keyword?\n\n👨🦳 **Uncle:** For exact facts:\n\nKeywords are perfect. You don't need fuzzy matching.\n\n👦 **Nephew:** And hybrid?\n\n👨🦳 **Uncle:** For most real scenarios. Maximum accuracy. That's what production uses.\n\n👦 **Nephew:** Uncle, we retrieve 5 chunks. But are they the *best* 5?\n\n👨🦳 **Uncle:** That's a great question. Let me show you a problem:\n\nSearch: \"Docker experience\"\n\nVector search returns:\n\nItems 3-5 shouldn't be in the top 5! Item 3 is negative, items 4-5 are weak.\n\nBasic vector search ranks by similarity score. But similarity doesn't capture quality.\n\n👦 **Nephew:** So we need a better ranker?\n\n👨🦳 **Uncle:** Yes. And here's the trick: we use two models.\n\n👦 **Nephew:** Two models? That sounds expensive.\n\n👨🦳 **Uncle:** Exactly. So we use them smartly.\n\n**Stage 1: Fast retrieval (milliseconds)**\n\nVector search returns 20 candidates. Speed matters here, so we use a fast model.\n\n**Stage 2: Accurate ranking (milliseconds)**\n\nWe rerank those 20 with a better, slower model.\n\nTotal time: 500ms. Still fast. But now the top 5 are *good*.\n\nThe key insight: We don't apply the accurate model to all 500,000 chunks. That's too slow. We apply it only to 20. That's feasible.\n\n👦 **Nephew:** What models should I use?\n\n👨🦳 **Uncle:** A few options:\n\n| Reranker | Source | Quality | Cost |\n|---|---|---|---|\n| BGE-Reranker-Base | Open source | Very Good | Free |\n| Cohere Rerank | API | Excellent | $$ |\n| Claude/GPT-4 | API | Best | $$$ |\n\nFor production? Use BGE. It's free, open-source, and performs almost as well as expensive options.\n\nHere's how:\n\n``` python\nfrom sentence_transformers import CrossEncoder\n\n# Load open-source reranker\nreranker = CrossEncoder('BAAI/bge-reranker-base')\n\n# We have top 20 chunks from vector search\ntop_20_chunks = [...list of chunks...]\n\n# Rerank them\nscores = reranker.predict([[query, chunk] for chunk in top_20_chunks])\n\n# Sort by score\nranked = sorted(zip(top_20_chunks, scores), key=lambda x: x[1], reverse=True)\n\n# Return top 5\nreturn ranked[:5]\n```\n\nThat's it. Takes 300ms. Makes huge quality difference.\n\n👦 **Nephew:** Uncle, when someone asks \"Does John have frontend skills?\", does the AI understand that means React, Vue, JavaScript, HTML?\n\n👨🦳 **Uncle:** Not automatically. Here's the problem:\n\nResume has: \"React, JavaScript, HTML, CSS expertise\"\n\nSearch for: \"frontend skills\"\n\nVector search looks for the exact embedding of \"frontend skills\". But the resume doesn't have those words. It has the components.\n\nA human reads \"React, JavaScript, HTML, CSS\" and immediately thinks \"frontend\". A computer doesn't make that leap.\n\n👦 **Nephew:** So how do we fix it?\n\n👨🦳 **Uncle:** Query rewriting. You expand the query.\n\n👦 **Nephew:** Expand it how?\n\n👨🦳 **Uncle:** Instead of searching for \"frontend skills\", you search for:\n\n(frontend OR React OR Vue OR Angular OR JavaScript OR HTML OR CSS)\n\nNow you find any mention of these. Much better.\n\nHere's code:\n\n``` python\ndef expand_query(query):\n    \"\"\"\n    Expand a query into related terms.\n    \"\"\"\n    # Simple approach using a dict\n    expansions = {\n        'frontend': ['React', 'Vue', 'Angular', 'JavaScript', 'HTML', 'CSS'],\n        'backend': ['Node.js', 'Python', 'Java', 'databases', 'APIs'],\n        'devops': ['Docker', 'Kubernetes', 'AWS', 'CI/CD', 'deployment'],\n    }\n\n    for key, synonyms in expansions.items():\n        if key in query.lower():\n            return query + \" OR \" + \" OR \".join(synonyms)\n\n    return query\n\n# Example\noriginal = \"frontend skills\"\nexpanded = expand_query(original)\n# Result: \"frontend skills OR React OR Vue OR Angular OR JavaScript OR HTML OR CSS\"\n```\n\n👦 **Nephew:** Is there a smarter way?\n\n👨🦳 **Uncle:** Yes! Instead of manually expanding, let an LLM generate variations.\n\nUser asks: \"Real-time system experience?\"\n\nLLM thinks: \"That means WebSockets, Socket.IO, real-time updates, pub/sub, message queues\"\n\nLLM generates multiple queries:\n\nYou search all 5 queries. You get chunks from all angles. You combine results.\n\nCoverage: 10x better.\n\n``` python\nfrom anthropic import Anthropic\n\ndef generate_queries(original_query):\n    \"\"\"\n    Generate multiple search queries from one.\n    \"\"\"\n    client = Anthropic()\n\n    response = client.messages.create(\n        model=\"claude-3-5-sonnet-20241022\",\n        max_tokens=500,\n        messages=[{\n            \"role\": \"user\",\n            \"content\": f\"\"\"\nGiven this question, generate 3-5 related search queries \nthat would find relevant information. Return only the queries, \none per line.\n\nOriginal question: {original_query}\n\nQueries:\n\"\"\"\n        }]\n    )\n\n    queries = response.content[0].text.strip().split('\\n')\n    return [q.strip() for q in queries if q.strip()]\n\n# Example\noriginal = \"Real-time system experience?\"\nqueries = generate_queries(original)\n# Returns:\n# [\"WebSockets real-time\", \"Socket.IO\", \"Redis pub/sub\", \"Message queues\", \"Real-time streaming\"]\n\n# Now search each\nall_results = []\nfor q in queries:\n    results = vector_search(q)\n    all_results.extend(results)\n\n# Deduplicate and return top 5\nunique_results = list({r['id']: r for r in all_results}.values())[:5]\n```\n\nThis is elegant. The LLM understands what \"real-time\" means and generates good search queries.\n\n👦 **Nephew:** Uncle, I built a RAG system. It works. But how do I know if it's *good*?\n\n👨🦳 **Uncle:** This is where most teams fail. They build, ship, and hope. No metrics.\n\nBut you can measure. Here are the important metrics:\n\n👦 **Nephew:** What should I measure?\n\n👨🦳 **Uncle:** Four main things:\n\n| Metric | Measures | Target | Impact |\n|---|---|---|---|\nRecall |\nOf all relevant docs, how many found? | >90% | Missing information |\nPrecision |\nOf what found, how relevant? | >85% | Wrong context |\nLatency |\nResponse time | <2 seconds | User experience |\nFaithfulness |\nAI stays factual | >95% | Hallucinations |\n\nLet me explain each.\n\n👦 **Nephew:** What's the difference?\n\n👨🦳 **Uncle:** Let me give you an example. Say there are 10 Docker-related chunks in John's resume.\n\n**Recall**: Of those 10, how many did you find?\n\nHigh recall means: \"I found most of the relevant information\"\n\n**Precision**: Of what you returned, how many were relevant?\n\nHigh precision means: \"Most of what I returned was useful\"\n\nUsually there's a tradeoff:\n\nYou want both high.\n\n👦 **Nephew:** How do I actually measure this?\n\n👨🦳 **Uncle:** You need labeled data. A test set.\n\nHere's code:\n\n``` python\ndef calculate_recall_precision(retrieved, relevant):\n    \"\"\"\n    retrieved: chunks your system found\n    relevant: chunks a human labeled as relevant\n    \"\"\"\n    retrieved_ids = set(r['id'] for r in retrieved)\n    relevant_ids = set(r['id'] for r in relevant)\n\n    # Recall: of all relevant, how many found?\n    if len(relevant_ids) == 0:\n        recall = 1.0\n    else:\n        recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)\n\n    # Precision: of what found, how many relevant?\n    if len(retrieved_ids) == 0:\n        precision = 0.0\n    else:\n        precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)\n\n    return recall, precision\n\n# Example\nretrieved = [\n    {'id': '1', 'text': 'Docker experience'},\n    {'id': '2', 'text': 'Kubernetes'},\n    {'id': '3', 'text': 'Machine learning'},\n]\n\nrelevant = [\n    {'id': '1', 'text': 'Docker experience'},\n    {'id': '2', 'text': 'Kubernetes'},\n    {'id': '4', 'text': 'Container orchestration'},\n]\n\nrecall, precision = calculate_recall_precision(retrieved, relevant)\n# Recall = 2/3 = 67% (found 2 of 3 relevant)\n# Precision = 2/3 = 67% (2 of 3 returned were relevant)\n```\n\n👦 **Nephew:** How do I measure speed?\n\n👨🦳 **Uncle:** Simple:\n\n``` python\nimport time\n\nstart = time.time()\nresults = rag_system.query(\"Does John have Docker experience?\")\nlatency = (time.time() - start) * 1000  # Convert to ms\n\nprint(f\"Latency: {latency}ms\")\n```\n\nTarget: <2000ms (2 seconds). Anything faster is bonus.\n\n👦 **Nephew:** What about cost?\n\n👨🦳 **Uncle:** Important for production:\n\n```\n# Cost per query\n# Embedding: $0.00001 per 1K tokens\n# Reranking: $0.0001 per query (if using paid service)\n# LLM answer: $0.001 per query (if using Claude)\n# Total: ~$0.0011 per query\n\n# If you do 1M queries/month:\nmonthly_cost = 1_000_000 * 0.0011  # $1100/month\n```\n\nThis is what you need to know for business decisions.\n\n👦 **Nephew:** Uncle, what's a hallucination?\n\n👨🦳 **Uncle:** Here's a real example:\n\nResume says: \"Worked with React for 2 years\"\n\nHiring manager asks: \"Experience with Kubernetes?\"\n\nAI responds: \"Yes, John has Kubernetes with orchestration.\"\n\nWRONG! The resume never mentions Kubernetes.\n\nThe AI invented an answer. It saw \"Docker\" or \"containers\" (maybe), thought \"Kubernetes is related to containers\", and hallucinated.\n\nThis is a hallucination. The AI is lying, but confidently.\n\n👦 **Nephew:** But why does this happen?\n\n👨🦳 **Uncle:** AIs are pattern-matching machines. They learned during training: \"Docker\" often appears near \"Kubernetes\". So the AI's brain has connected them.\n\nEven if you show the AI only a resume mentioning Docker, its training whispers: \"And Kubernetes is usually near Docker!\"\n\nSo the AI thinks: \"Probably Kubernetes too.\"\n\nIt's an artifact of how AIs learn. Not a bug. A feature used wrongly.\n\n👦 **Nephew:** How do you stop it?\n\n👨🦳 **Uncle:** Five layers of protection:\n\n**Layer 1: Retrieval Boundaries**\n\nShow the AI ONLY the retrieved chunks. Nothing from training.\n\nIn your prompt:\n\n```\nBased ONLY on the following information:\n\n[Retrieved chunks]\n\nAnswer: Does John have Kubernetes experience?\n```\n\nThe word \"ONLY\" is critical.\n\n**Layer 2: Structured Output**\n\nForce JSON format. Constrain the answer shape.\n\n```\nresponse = client.messages.create(\n    model=\"claude-3-5-sonnet-20241022\",\n    messages=[...],\n    system=\"\"\"\nYou MUST respond in JSON format:\n{\n  \"answer\": \"yes\" or \"no\",\n  \"confidence\": 0.0 to 1.0,\n  \"evidence\": [\"chunk 1\", \"chunk 2\"]\n}\n\"\"\",\n)\n```\n\nThis forces the AI to cite evidence or it can't answer.\n\n**Layer 3: Validation**\n\nCheck the output against the original chunks:\n\n```\nanswer = json.loads(response.content[0].text)\n\n# Check: does evidence exist in chunks?\nfor evidence in answer['evidence']:\n    if evidence not in chunks:\n        # Hallucination detected!\n        answer['confidence'] = 0  # Reject\n\nreturn answer\n```\n\n**Layer 4: Confidence Gating**\n\nLow confidence? Escalate to human:\n\n```\nif answer['confidence'] < 0.7:\n    # Hand off to human\n    return {\n        'answer': 'Unknown - escalated to human',\n        'reason': 'Low confidence'\n    }\n```\n\n**Layer 5: Required Citations**\n\nEvery claim must point to a chunk:\n\n```\nQuestion: Does John know Kubernetes?\nRetrieved chunks:\n[1] \"Docker and container experience\"\n[2] \"Kubernetes cluster management\"\n\nAnswer: Yes, John knows Kubernetes (from chunk 2).\n```\n\nThe AI must cite. No citation? No answer.\n\n👦 **Nephew:** Do you use all five?\n\n👨🦳 **Uncle:** In production? All five. At different stages.\n\nRetrieval boundaries + structured output stop most hallucinations.\n\nValidation + confidence gating catch the rest.\n\nCitations let users verify.\n\nTogether? Hallucinations become nearly impossible.\n\n👦 **Nephew:** Uncle, my RAG system works on my laptop. But production is different?\n\n👨🦳 **Uncle:** Completely different.\n\n**Prototype version:**\n\n**Production version:**\n\nThese are not the same system.\n\n👦 **Nephew:** How do I design it?\n\n👨🦳 **Uncle:** Here's a proven architecture:\n\n```\n                    Users\n                      ↓\n              Load Balancer (Nginx)\n             /          |          \\\n         API-1      API-2      API-3  (scale horizontally)\n             \\          |          /\n                      ↓\n                 Cache Layer (Redis)\n                      ↓\n           PostgreSQL + pgvector + pgvector Index\n                      ↓\n              External APIs (Claude, etc.)\n```\n\n**Load Balancer**\n\n**Multiple APIs**\n\n**Cache**\n\n**Database**\n\n**External APIs**\n\nEach layer has a job. Each layer is replaceable.\n\n👦 **Nephew:** Wait, what if I'm hosting this for multiple companies?\n\n👨🦳 **Uncle:** Then you have a SERIOUS security issue if you don't handle isolation.\n\nCompany A's data must NEVER be visible to Company B.\n\nEvery single query must include:\n\n```\nSELECT * FROM resume_chunks \nWHERE resume_id = 'john-123'\n  AND tenant_id = 'company-a'  ← THIS IS CRITICAL\n```\n\nWithout the `tenant_id`\n\ncheck, a hacker or bug could leak data.\n\nThis is not optional. This is \"you'll be sued\" level of important.\n\n👦 **Nephew:** Uncle, I've heard of ATS. What is it?\n\n👨🦳 **Uncle:** Old ATS systems just searched for keywords. \"Does the resume have the word React?\" Yes/No.\n\nModern ATS - the kind you're building with RAG - is intelligent. It understands:\n\nResume gets a score. Top scorers get interviewed. Bad scorers get rejected.\n\nThe difference between old and new? Old ATS rejects good candidates because they phrased things differently. New ATS finds them anyway.\n\n👦 **Nephew:** How do you score a resume?\n\n👨🦳 **Uncle:** With components. Job requirements × candidate abilities.\n\nJob description requires:\n\nLet's say this resume has:\n\nHere's the scoring:\n\n``` python\ndef score_resume(candidate, job_requirements):\n    \"\"\"\n    Score a resume against job requirements.\n    \"\"\"\n    scores = {}\n\n    # Required skills: 60% weight\n    required_score = 0\n    required_count = 0\n    for skill, years_required in job_requirements['required'].items():\n        required_count += 1\n        candidate_years = candidate.get(skill, 0)\n\n        if candidate_years >= years_required:\n            required_score += 100  # Full points\n        elif candidate_years > 0:\n            # Partial credit\n            required_score += (candidate_years / years_required) * 100\n        # else: 0 points\n\n    required_avg = required_score / required_count if required_count > 0 else 0\n\n    # Preferred skills: 25% weight\n    preferred_score = 0\n    preferred_count = 0\n    for skill in job_requirements['preferred']:\n        preferred_count += 1\n        if skill in candidate:\n            preferred_score += 50  # Half credit for preferred\n\n    preferred_avg = preferred_score / preferred_count if preferred_count > 0 else 0\n\n    # Projects: 15% weight\n    project_score = len(candidate.get('projects', [])) * 25\n    project_avg = min(project_score, 100)  # Cap at 100\n\n    # Final score\n    final_score = (\n        required_avg * 0.60 +\n        preferred_avg * 0.25 +\n        project_avg * 0.15\n    )\n\n    return final_score\n\n# Example\ncandidate = {\n    'React': 6,\n    'JavaScript': 7,\n    'Node.js': 4,\n    'projects': ['E-commerce app', 'Real-time chat']\n}\n\nrequirements = {\n    'required': {'React': 5, 'JavaScript': 3},\n    'preferred': ['Node.js', 'AWS']\n}\n\nscore = score_resume(candidate, requirements)\n# React: (6 >= 5) = 100 pts\n# JavaScript: (7 >= 3) = 100 pts\n# Required avg: 100\n# Node.js: has it = 50 pts\n# Preferred avg: 25\n# Projects: 2 × 25 = 50 pts\n# Final: (100 × 0.60) + (25 × 0.25) + (50 × 0.15) = 60 + 6.25 + 7.5 = 73.75/100\n```\n\nThat's scoring. But wait - how do you extract the data?\n\n👦 **Nephew:** Resume might say \"React\" but another might say \"React.js\". How do you normalize?\n\n👨🦳 **Uncle:** Build a dictionary.\n\n```\nskill_aliases = {\n    'React': ['React', 'React.js', 'ReactJS', 'react'],\n    'Node.js': ['Node.js', 'node.js', 'nodejs', 'Node'],\n    'Docker': ['Docker', 'docker', 'containers'],\n}\n\ndef normalize_skill(mentioned_skill):\n    \"\"\"Find the canonical skill name.\"\"\"\n    mentioned_lower = mentioned_skill.lower()\n\n    for canonical, aliases in skill_aliases.items():\n        if mentioned_lower in [a.lower() for a in aliases]:\n            return canonical\n\n    return mentioned_skill  # Unknown skill\n\n# Example\nprint(normalize_skill(\"React.js\"))  # Output: React\nprint(normalize_skill(\"nodejs\"))    # Output: Node.js\n```\n\nNow \"React\" and \"React.js\" are treated as the same skill.\n\n👦 **Nephew:** But what if the resume mentions a skill not in your dictionary?\n\n👨🦳 **Uncle:** Use embeddings to find similar skills.\n\nResume mentions: \"Built trading platform with real-time updates\"\n\nJob requires: \"WebSockets experience\"\n\nThese aren't in your dictionary. But you can check similarity:\n\n``` python\ndef find_skill_match(mentioned_skill, required_skill):\n    \"\"\"\n    Use embeddings to find matches for unknown skills.\n    \"\"\"\n    # Get embeddings\n    mentioned_embedding = get_embedding(mentioned_skill)\n    required_embedding = get_embedding(required_skill)\n\n    # Calculate distance\n    distance = cosine_distance(mentioned_embedding, required_embedding)\n\n    # If similar enough, count as a match\n    if distance > 0.75:  # Threshold\n        return True\n\n    return False\n\n# Example\nif find_skill_match(\"real-time data updates\", \"WebSockets\"):\n    # Candidate probably has WebSockets experience\n    score += 100\n```\n\nThis is powerful. It catches variations and similar technologies.\n\n👦 **Nephew:** Uncle, we have RAG working. What comes next?\n\n👨🦳 **Uncle:** Basic RAG answers questions. Advanced RAG (agents) *decides* what to do.\n\nDifference:\n\nBasic RAG: \"Does John know React?\" → Find resume → Answer\n\nAgent RAG: \"Is John qualified for Senior Developer role?\" →\n\nThe agent is intelligent about what to search.\n\n👦 **Nephew:** How do I build an agent?\n\n👨🦳 **Uncle:** Using Claude's tool use (function calling):\n\n``` python\nimport anthropic\n\nclient = anthropic.Anthropic()\n\ndef search_resume(resume_id, query):\n    \"\"\"Search candidate's resume.\"\"\"\n    # Your RAG search logic\n    return rag_system.query(resume_id, query)\n\ndef search_job_description(job_id, query):\n    \"\"\"Search job description.\"\"\"\n    return job_db.query(job_id, query)\n\ndef analyze_candidate(resume_id, job_id):\n    \"\"\"\n    Use Claude as an agent to analyze candidate fit.\n    Claude decides what to search, what to compare.\n    \"\"\"\n    tools = [\n        {\n            \"name\": \"search_resume\",\n            \"description\": \"Search a candidate's resume for information\",\n            \"input_schema\": {\n                \"type\": \"object\",\n                \"properties\": {\n                    \"resume_id\": {\"type\": \"string\"},\n                    \"query\": {\"type\": \"string\"}\n                },\n                \"required\": [\"resume_id\", \"query\"]\n            }\n        },\n        {\n            \"name\": \"search_job\",\n            \"description\": \"Search job requirements\",\n            \"input_schema\": {\n                \"type\": \"object\",\n                \"properties\": {\n                    \"job_id\": {\"type\": \"string\"},\n                    \"query\": {\"type\": \"string\"}\n                },\n                \"required\": [\"job_id\", \"query\"]\n            }\n        }\n    ]\n\n    messages = [{\n        \"role\": \"user\",\n        \"content\": f\"\"\"\nAnalyze if candidate {resume_id} is qualified for job {job_id}.\n\nUse the available tools to:\n1. Find required skills in the job description\n2. Check if the candidate has those skills\n3. Compare years of experience\n4. Look for relevant projects\n5. Make a recommendation\n\nProvide detailed reasoning.\n\"\"\"\n    }]\n\n    # Agent loop\n    while True:\n        response = client.messages.create(\n            model=\"claude-3-5-sonnet-20241022\",\n            max_tokens=1000,\n            tools=tools,\n            messages=messages\n        )\n\n        # Check if Claude wants to use tools\n        if response.stop_reason == \"tool_use\":\n            # Process tool calls\n            tool_results = []\n            for content in response.content:\n                if content.type == \"tool_use\":\n                    if content.name == \"search_resume\":\n                        result = search_resume(\n                            content.input[\"resume_id\"],\n                            content.input[\"query\"]\n                        )\n                    else:  # search_job\n                        result = search_job_description(\n                            content.input[\"job_id\"],\n                            content.input[\"query\"]\n                        )\n\n                    tool_results.append({\n                        \"type\": \"tool_result\",\n                        \"tool_use_id\": content.id,\n                        \"content\": str(result)\n                    })\n\n            # Add Claude's response and tool results to messages\n            messages.append({\"role\": \"assistant\", \"content\": response.content})\n            messages.append({\"role\": \"user\", \"content\": tool_results})\n\n        else:\n            # Claude is done - extract final answer\n            final_answer = \"\"\n            for content in response.content:\n                if hasattr(content, 'text'):\n                    final_answer = content.text\n\n            return final_answer\n\n# Use it\nresult = analyze_candidate(\"john-resume\", \"senior-dev-job\")\nprint(result)\n```\n\nThis is elegant. Claude is the brain. It decides what to search, in what order, and how to reason about it.\n\n👦 **Nephew:** What's next? After agents?\n\n👨🦳 **Uncle:** A few directions:\n\n**1. Adaptive RAG**\n\nSystem learns what works. If a question fails, it tries different search strategies automatically.\n\n**2. Multi-modal RAG**\n\nRight now we handle text. Soon: images, tables, videos. A resume with a photo of their projects. A video walkthrough.\n\n**3. Real-time RAG**\n\nSystem continuously updates knowledge. Candidate updates LinkedIn → RAG system knows instantly.\n\n**4. Collaborative RAG**\n\nMultiple agents reasoning together. One searches, one evaluates, one questions.\n\n**5. Explainable RAG**\n\nSystem shows its work. \"I rejected John because he doesn't have 5 years React, only 3 years, and React 5+ was required.\"\n\nThese are coming. RAG is only beginning.\n\n👨🦳 **Uncle:** Let me summarize what you've learned:\n\n👦 **Nephew:** That's a lot, uncle. But I think I understand. RAG isn't magic - it's just engineering.\n\n👨🦳 **Uncle:** Exactly. RAG is:\n\nEvery layer serves one purpose: make AI useful for real people.\n\n👦 **Nephew:** And production?\n\n👨🦳 **Uncle:** Remember: Reliable for 10 users >> Perfect for nobody.\n\nStart simple. Build on PostgreSQL. Measure everything. Iterate fast.\n\nYou don't need fancy systems. You need good engineering.\n\n👦 **Nephew:** Uncle, one last question.\n\n👨🦳 **Uncle:** Go ahead.\n\n👦 **Nephew:** How do I actually start? Like, code?\n\n👨🦳 **Uncle:** That's next lesson, beta. First, you understand the system. Then you build it.\n\nBut I'll give you one gift: a simple starter stack:\n\n```\nFrontend: Next.js + React\nBackend: Node.js + Express\nDatabase: PostgreSQL + pgvector\nEmbedding: Claude API (embeddings endpoint)\nLLM: Claude API (messages endpoint)\nHosting: AWS Lightsail or Render\n```\n\nSimple. Proven. Scales.\n\nNow go build something amazing.\n\n``` python\nfrom anthropic import Anthropic\n\nclient = Anthropic()\n\ndef create_embeddings(texts):\n    \"\"\"Get embeddings from Claude.\"\"\"\n    response = client.messages.create(\n        model=\"claude-3-5-sonnet-20241022\",\n        max_tokens=1024,\n        system=\"You are an embedding service. Return ONLY JSON.\",\n        messages=[{\n            \"role\": \"user\",\n            \"content\": f\"Get embeddings for: {texts}\"\n        }]\n    )\n    return response.content[0].text\njs\nSELECT chunk_text, \n       embedding <=> query_embedding AS vector_distance,\n       ts_rank(to_tsvector(chunk_text), \n               plainto_tsquery('search_term')) AS keyword_rank\nFROM resume_chunks\nWHERE resume_id = 'candidate-uuid'\n  AND to_tsvector(chunk_text) @@ plainto_tsquery('search_term')\nORDER BY vector_distance ASC, keyword_rank DESC\nLIMIT 5;\npython\ndef score_candidate(candidate_skills, required_skills, weight=0.6):\n    \"\"\"Simple scoring: % of required skills matched.\"\"\"\n    matched = len([s for s in candidate_skills if s in required_skills])\n    score = (matched / len(required_skills)) * 100\n    return score\npython\ndef safe_answer(query, chunks, confidence_threshold=0.7):\n    \"\"\"Answer only if confident.\"\"\"\n    response = client.messages.create(\n        model=\"claude-3-5-sonnet-20241022\",\n        max_tokens=500,\n        system=\"\"\"\nYou MUST:\n1. Answer ONLY using the provided chunks\n2. Cite evidence\n3. Return JSON with answer, confidence, evidence\n\"\"\",\n        messages=[{\n            \"role\": \"user\",\n            \"content\": f\"\"\"\nChunks: {chunks}\nQuestion: {query}\n\"\"\"\n        }]\n    )\n\n    answer = json.loads(response.content[0].text)\n\n    if answer['confidence'] < confidence_threshold:\n        return \"Uncertain - escalated to human\"\n\n    return answer['answer']\n```\n\nRAG isn't magic. It's engineering:\n\nYou've learned the foundations. Now use them.\n\nBuild something reliable.\n\nBuild something honest.\n\nBuild something useful.\n\nYou've got this.\n\n*Created for developers who want to understand how RAG actually works, not just use it as a black box.*\n\n**Remember: Less noise, more action.**\n\nGo build. Good luck. 🚀", "url": "https://wpnews.pro/news/rag-pipeline-the-uncle-nephew-complete-learning-guide", "canonical_source": "https://dev.to/surajrkhonde/rag-pipeline-the-uncle-nephew-complete-learning-guide-7h4", "published_at": "2026-06-20 09:44:17+00:00", "updated_at": "2026-06-20 10:06:48.254422+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "natural-language-processing", "ai-agents"], "entities": ["RAG", "Retrieval-Augmented Generation", "React", "JavaScript", "Docker", "Kubernetes"], "alternates": {"html": "https://wpnews.pro/news/rag-pipeline-the-uncle-nephew-complete-learning-guide", "markdown": "https://wpnews.pro/news/rag-pipeline-the-uncle-nephew-complete-learning-guide.md", "text": "https://wpnews.pro/news/rag-pipeline-the-uncle-nephew-complete-learning-guide.txt", "jsonld": "https://wpnews.pro/news/rag-pipeline-the-uncle-nephew-complete-learning-guide.jsonld"}}