RAG Pipeline: The Uncle-Nephew Complete Learning Guide

A developer explains that Retrieval-Augmented Generation (RAG) fixes AI hallucinations by having the model fetch relevant documents before answering, rather than relying solely on memorized training data. The guide breaks down RAG into three steps and highlights that the main challenge is finding the right documents quickly and handling synonyms.

How to Build Systems That Actually Know Your Data Not Hallucinate About It 👦 Nephew: Uncle, I keep hearing "RAG this, RAG that" in tech interviews. When I ask what it means, people throw around words like "Retrieval-Augmented Generation" and I just nod like I understand. But honestly? I'm lost. 👨🦳 Uncle: laughing That's the best honest question I've heard all week. Let me ask you something first. If I gave you a question right now - "What year did India win the World Cup?" - how would you answer? 👦 Nephew: Well... I'd pull up Google, search for it, read the answer, then tell you. 👨🦳 Uncle: Exactly. You don't answer from memory alone. You go fetch the information first, then answer based on what you found . That's RAG in real life. And that simple idea - fetch first, answer after - fixes almost every problem we face with AI today. 👦 Nephew: But uncle, AI can remember things from its training. Why does it need to fetch? 👨🦳 Uncle: Ah That's where we land in trouble. Come, sit... 👨🦳 Uncle: Imagine you're hiring for a tech company. You receive 500 resumes for a Senior React Developer role. Now tell me - how would you actually process them? 👦 Nephew: I'd... probably make a spreadsheet? List all the candidates with key skills? 👨🦳 Uncle: Right. But here's the catch - you can't read all 500 resumes deeply. So what do you really do? 👦 Nephew: Skim for keywords like "React", "JavaScript", "5 years"? 👨🦳 Uncle: Exactly. You skim and hope you don't miss anyone good. Now, here's the problem: what if a candidate wrote "React.js" instead of "React"? Your eyes might still catch it. But a dumb computer doing exact string matching? It says "no match". What if someone wrote "Built real-time user interfaces with the React framework"? The candidate clearly knows React, but the word "React" appears nowhere in that sentence. The computer misses them. This is exactly what happens with AI. When you ask an AI a question, it tries to answer purely from what it memorized during training. And it makes mistakes - sometimes big ones. 👦 Nephew: So RAG stops these mistakes? 👨🦳 Uncle: Precisely. RAG is simple: instead of asking the AI to guess, you hand it the document first. You say, "Here's the resume, here's the job description - now answer my question." The AI stops guessing. It reads the evidence. It answers correctly. 👦 Nephew: Okay, so RAG = give the AI documents, then ask questions? 👨🦳 Uncle: Yes, but we need to be precise about how we give it documents. RAG has three steps: That's it. R-A-G. 👦 Nephew: But if it's that simple, why is everyone talking about it like it's complicated? 👨🦳 Uncle: Because "finding the right documents" is the hard part. You have 500 resumes. When someone asks "Does John have Docker experience?", you can't search all 500 linearly. That's slow. And you can't use simple keyword search either - what if John wrote "containerization" instead of "Docker"? What if he wrote "I work with Kubernetes" - which means he knows Docker too? So the real question becomes: How do you find the right documents fast, and how do you understand when two different words mean the same thing? This is where everything else - embeddings, vector databases, chunking - comes in. They're all in service of solving that one problem. 👦 Nephew: Okay, I think I get the big picture. But uncle, why can't we just make the AI smarter? 👨🦳 Uncle: Two reasons. First - you train an AI once. After that, it doesn't learn new information. If your company has 100 internal policies created last month, the AI knows nothing about them. It can't learn them instantly. Second - even if you could retrain it, hallucinations would still happen. AIs are pattern-matching machines. They're brilliant at patterns. But they sometimes see patterns that aren't there. RAG forces the AI to cite its sources, to point at evidence. It's the difference between: The second one is what you want. That's RAG. 👦 Nephew: Uncle, I have a question. How does a computer know that "React" and "React.js" are the same thing? 👨🦳 Uncle: That's the right question. Let me explain with a story. I go to a fruit market. I see apples, mangoes, oranges. How do I know which is which? I look at them - color, shape, smell. My brain recognizes patterns and says "that's a mango". Now, how does a computer do that? It doesn't have eyes. All it has are numbers. 👦 Nephew: Numbers? 👨🦳 Uncle: Yes. Here's the trick: every word - "React", "React.js", "JavaScript", "Python" - is just a number to a computer. Or actually, a list of numbers. We call this list an "embedding". 👦 Nephew: Like a list of... what kind of numbers? 👨🦳 Uncle: Imagine I'm describing a person to you. I might say: That's 4 numbers describing 1 person. Now imagine I use 1536 numbers instead. Each number describes a different quality - not just physical things, but hidden things like "how technical is this word", "is this related to web development", "how often is this used", and so on. So "React" becomes a point in a 1536-dimensional space. "React.js" becomes another point in that same space. And because they mean the same thing, those two points are very close to each other . But "React" and "Python"? Those points are far apart. 👦 Nephew: Okay, so closeness = similarity? 👨🦳 Uncle: Exactly. And here's the beautiful part: you don't have to manually create these embeddings. An AI model does it for you. You feed it the word "React" and it says, "This word should be represented as 0.12, -0.45, 0.78, ..., 1536 numbers ". Different models might represent it differently, but the same model always represents similar concepts closely. That's what matters. 👦 Nephew: So embeddings let computers understand meaning? 👨🦳 Uncle: They let computers represent meaning as a position in space . And when you represent things as positions, you can do math with them. For example, if I tell you: "Engineer" - "Java" + "React" = ? 👦 Nephew: That's... weird math? 👨🦳 Uncle: Right? But with embeddings, you can actually do this. And you know what the answer is? The vector closest to that result is usually "web developer" or something similar. The math captures meaning. Now here's why this matters for RAG: When someone asks "Does John have Docker experience?", we convert that question to an embedding - a point in 1536D space. Then we search for resume chunks that are close to that point. The chunks mentioning Docker - whether they say "Docker", "containerization", "container orchestration", or "I deploy with Docker" - are all close to the question's embedding. So we find them all. And the AI reads them. And the AI answers correctly. Without embeddings, simple keyword search would miss half of them. 👦 Nephew: Okay, so we have embeddings. But uncle, if I have a company with 10,000 resumes, and each resume has 50 chunks, that's 500,000 embeddings. Each embedding is 1536 numbers. How do I store and search that? 👨🦳 Uncle: This is where most people choose poorly. They think: "I'll use Pinecone I'll use Milvus I'll use Weaviate " They add three systems. One for storage, one for search, one for cache. Now they have 3 things that could break. 3 things to debug. 3 bills to pay. There's a better way. 👦 Nephew: What? 👨🦳 Uncle: PostgreSQL. 👦 Nephew: The database? Just... PostgreSQL? 👨🦳 Uncle: With one addition: the pgvector extension. PostgreSQL is already reliable, it's already running your data, it's already scaling. We just add vector support. Think about it this way: 500,000 vectors stored in PostgreSQL as data. When you have a question, you convert it to an embedding, and you say: js SELECT FROM resume chunks ORDER BY embedding <= question embedding LIMIT 5 That <= operator means "find the 5 vectors closest to my question vector". PostgreSQL finds them in milliseconds. You're done. No new infrastructure. No vendor lock-in. Just one database doing one job well. 👦 Nephew: But doesn't the <= operator need an index to be fast? 👨🦳 Uncle: Good catch. Yes. PostgreSQL's pgvector creates an IVFFLAT index by default. Think of it like this: When you have 500,000 resumes and need to find the 5 closest to a question, searching all 500,000 linearly is slow. The index divides the 1536D space into regions like neighborhoods . When you search, it says "The question's embedding is closest to this neighborhood" and only searches that neighborhood. Much faster. A linear search of 500K vectors? 5 seconds. With the index? 50-100ms. That's the magic of the right index. 👦 Nephew: How do I actually create this? 👨🦳 Uncle: Simple. PostgreSQL + pgvector. Here's what you do: -- Create extension CREATE EXTENSION IF NOT EXISTS vector; -- Create table with vector column CREATE TABLE resume chunks id SERIAL PRIMARY KEY, resume id UUID, chunk text TEXT, embedding vector 1536 , created at TIMESTAMP DEFAULT NOW ; -- Create index for fast search CREATE INDEX ON resume chunks USING ivfflat embedding vector cosine ops WITH lists = 100 ; Now you have a table. You insert chunks with their embeddings. You query with the <= operator. Done. 👦 Nephew: That's it? 👨🦳 Uncle: That's it. No API calls to external services. No paying per query. Just a single, reliable database. This is the beauty of choosing the right tool. 👦 Nephew: Uncle, we have embeddings, we have a database. How does actual retrieval happen? 👨🦳 Uncle: Let me walk you through a real scenario. A hiring manager asks: "Does Priya have blockchain experience?" Step 1: Understand the question We need to find chunks about blockchain skills. Step 2: Convert question to embedding "blockchain experience" becomes a vector: 0.34, -0.12, ..., 1536 numbers Step 3: Search the database We query: SELECT chunk text, similarity FROM resume chunks WHERE resume id = 'priya-uuid' ORDER BY embedding <= question embedding LIMIT 5 Step 4: Get results might be : Step 5: Send to AI We put these 5 chunks in the prompt: Based on the following resume excerpts: chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 Question: Does Priya have blockchain experience? Step 6: AI answers "Yes, Priya has strong blockchain experience. She has worked with Ethereum smart contracts, understands Bitcoin protocol, and has built on Web3/DeFi platforms using Solidity." With evidence. 👦 Nephew: This seems perfect. What's the problem? 👨🦳 Uncle: The problem is subtle. Resume has this line: "I worked on distributed ledger technology and consensus protocols" This person knows blockchain. But the word "blockchain" is not there. You search for "blockchain". Embeddings are smart, but not magic. "Distributed ledger technology" and "blockchain" are close but not identical. Sometimes the vector search returns this chunk. Sometimes it doesn't. Basic retrieval works maybe 70% of the time. For production, you need 95%+. That's why we need advanced techniques. 👦 Nephew: So what makes it better? 👨🦳 Uncle: Multiple techniques working together. But we'll get there. First, let's talk about preparation. 👦 Nephew: Uncle, let me ask something. Why do we break resumes into chunks at all? Why not just embed the whole resume? 👨🦳 Uncle: Good question. Two reasons: Cost - An embedding costs money usually . A 50-page resume broken into chunks might cost $0.50 instead of $5 if you embed it whole. Precision - If you embed the entire resume as one chunk, and you search for "Docker experience", the system returns the whole 50-page resume. The AI has to find the Docker mention itself. But if you break it into chunks, the system returns only the chunks about Docker. The AI reads less noise. But here's the problem: how small should chunks be? 👦 Nephew: Can't I just pick a number? Like 500 tokens per chunk? 👨🦳 Uncle: Let's see what happens with different sizes. Attempt 1: Too small 100 tokens Chunk 1: "John has" Chunk 2: "5 years of" Chunk 3: "React experience" Problem: When we read Chunk 3, we've lost context. 5 years of what ? The AI is confused. Attempt 2: Right size 1000 tokens "John has 5 years of React experience, built e-commerce platforms, Redux for state management, and mentored junior developers." Perfect. Context is clear. Attempt 3: Too large 5000 tokens Entire work history section, with all jobs, all skills, everything mixed together. Problem: Too much noise. When the AI searches for "React", it gets 5000 tokens to read, but only 2 sentences mention React. It's slow and confusing. 👨🦳 Uncle: The sweet spot is usually 1000-1500 tokens per chunk. Not less, not more. 👦 Nephew: What's overlap? Overlapping chunks? 👨🦳 Uncle: Yes. Imagine this: Without overlap: Chunk 1 ends: "...John used React..." Chunk 2 starts: "...for building e-commerce..." When you read Chunk 2 alone, you don't know John used React for this. Context is broken. With 200-token overlap: Chunk 1: "...John has React experience. He built e-commerce..." Chunk 2: "...He built e-commerce with Redux and WebSockets..." Now each chunk contains enough context. You can read them individually and still understand. 👦 Nephew: So overlap = context continuity? 👨🦳 Uncle: Exactly. It's like reading a book. Each page overlaps the previous one slightly. You always have context. 👦 Nephew: Are there different ways to chunk? 👨🦳 Uncle: Yes. Here are the main ones: | Strategy | Chunk Size | Overlap | Best For | Trade-off | |---|---|---|---|---| | Naive/Fixed | 512 tokens | None | Simple code | Loses context | | Sliding Window | 1000-1500 | 200 | Resumes, documents | Balanced | | Semantic | Variable | Variable | Technical docs | Complex to implement | | Recursive | 1000 first, then smaller | 200 | Large documents | More processing | For resumes? Sliding window, 1000 tokens, 200-token overlap. It's what works best. Here's example code: python def sliding window chunk text, window size=1000, overlap=200 : """ Break text into chunks with overlap. window size = tokens per chunk overlap = tokens of overlap between chunks """ tokens = text.split Simplified tokenization chunks = for i in range 0, len tokens , window size - overlap : chunk = tokens i:i + window size chunks.append " ".join chunk if i + window size = len tokens : break return chunks Example resume text = "John has 5 years React... long text " chunks = sliding window chunk resume text, 1000, 200 chunks 0 = "John has 5 years React... 1000 tokens " chunks 1 = " last 200 tokens of chunk 0 ... next 800 tokens " Simple. Effective. This is what production systems use. 👦 Nephew: Uncle, embeddings are smart. But are they perfect? 👨🦳 Uncle: No. Here's a real problem: Resume has: "Kubernetes 1.26" You search for: "Kubernetes 1.26" Vector search thinks: FALSE MATCH The versions are different, but vectors say they're similar. 👦 Nephew: So vectors miss exact matches? 👨🦳 Uncle: Not miss - they're fuzzy. They blur differences. Sometimes that's good finding "React" when you search "React.js" . Sometimes it's bad confusing version numbers . This is why you need both: vectors AND keywords. 👦 Nephew: How does keyword search work? 👨🦳 Uncle: Simple. Exact string matching. Resume: "Kubernetes 1.26" Search: "1.26" Result: EXACT MATCH. Yes. Resume: "Kubernetes 1.25" Search: "1.26" Result: NO MATCH. No. It's binary. No fuzzy. Just right or wrong. 👦 Nephew: So we use both? 👨🦳 Uncle: Exactly. Here's how: Search "Kubernetes 1.26" Step 1: Vector search on all chunks Step 2: Keyword filter Step 3: Combine and rerank Result: Perfect accuracy. Both semantic understanding AND exact precision. PostgreSQL can do this: SELECT chunk text FROM resume chunks WHERE resume id = 'john-uuid' AND to tsvector chunk text @@ plainto tsquery 'Kubernetes 1.26' AND embedding <= question embedding < 0.2 ORDER BY embedding <= question embedding LIMIT 5 This query says: Both working together. 👦 Nephew: When do I use pure vector? 👨🦳 Uncle: When you're searching for concepts: Vectors are great here. They find related concepts even with different words. 👦 Nephew: And pure keyword? 👨🦳 Uncle: For exact facts: Keywords are perfect. You don't need fuzzy matching. 👦 Nephew: And hybrid? 👨🦳 Uncle: For most real scenarios. Maximum accuracy. That's what production uses. 👦 Nephew: Uncle, we retrieve 5 chunks. But are they the best 5? 👨🦳 Uncle: That's a great question. Let me show you a problem: Search: "Docker experience" Vector search returns: Items 3-5 shouldn't be in the top 5 Item 3 is negative, items 4-5 are weak. Basic vector search ranks by similarity score. But similarity doesn't capture quality. 👦 Nephew: So we need a better ranker? 👨🦳 Uncle: Yes. And here's the trick: we use two models. 👦 Nephew: Two models? That sounds expensive. 👨🦳 Uncle: Exactly. So we use them smartly. Stage 1: Fast retrieval milliseconds Vector search returns 20 candidates. Speed matters here, so we use a fast model. Stage 2: Accurate ranking milliseconds We rerank those 20 with a better, slower model. Total time: 500ms. Still fast. But now the top 5 are good . The key insight: We don't apply the accurate model to all 500,000 chunks. That's too slow. We apply it only to 20. That's feasible. 👦 Nephew: What models should I use? 👨🦳 Uncle: A few options: | Reranker | Source | Quality | Cost | |---|---|---|---| | BGE-Reranker-Base | Open source | Very Good | Free | | Cohere Rerank | API | Excellent | $$ | | Claude/GPT-4 | API | Best | $$$ | For production? Use BGE. It's free, open-source, and performs almost as well as expensive options. Here's how: python from sentence transformers import CrossEncoder Load open-source reranker reranker = CrossEncoder 'BAAI/bge-reranker-base' We have top 20 chunks from vector search top 20 chunks = ...list of chunks... Rerank them scores = reranker.predict query, chunk for chunk in top 20 chunks Sort by score ranked = sorted zip top 20 chunks, scores , key=lambda x: x 1 , reverse=True Return top 5 return ranked :5 That's it. Takes 300ms. Makes huge quality difference. 👦 Nephew: Uncle, when someone asks "Does John have frontend skills?", does the AI understand that means React, Vue, JavaScript, HTML? 👨🦳 Uncle: Not automatically. Here's the problem: Resume has: "React, JavaScript, HTML, CSS expertise" Search for: "frontend skills" Vector search looks for the exact embedding of "frontend skills". But the resume doesn't have those words. It has the components. A human reads "React, JavaScript, HTML, CSS" and immediately thinks "frontend". A computer doesn't make that leap. 👦 Nephew: So how do we fix it? 👨🦳 Uncle: Query rewriting. You expand the query. 👦 Nephew: Expand it how? 👨🦳 Uncle: Instead of searching for "frontend skills", you search for: frontend OR React OR Vue OR Angular OR JavaScript OR HTML OR CSS Now you find any mention of these. Much better. Here's code: python def expand query query : """ Expand a query into related terms. """ Simple approach using a dict expansions = { 'frontend': 'React', 'Vue', 'Angular', 'JavaScript', 'HTML', 'CSS' , 'backend': 'Node.js', 'Python', 'Java', 'databases', 'APIs' , 'devops': 'Docker', 'Kubernetes', 'AWS', 'CI/CD', 'deployment' , } for key, synonyms in expansions.items : if key in query.lower : return query + " OR " + " OR ".join synonyms return query Example original = "frontend skills" expanded = expand query original Result: "frontend skills OR React OR Vue OR Angular OR JavaScript OR HTML OR CSS" 👦 Nephew: Is there a smarter way? 👨🦳 Uncle: Yes Instead of manually expanding, let an LLM generate variations. User asks: "Real-time system experience?" LLM thinks: "That means WebSockets, Socket.IO, real-time updates, pub/sub, message queues" LLM generates multiple queries: You search all 5 queries. You get chunks from all angles. You combine results. Coverage: 10x better. python from anthropic import Anthropic def generate queries original query : """ Generate multiple search queries from one. """ client = Anthropic response = client.messages.create model="claude-3-5-sonnet-20241022", max tokens=500, messages= { "role": "user", "content": f""" Given this question, generate 3-5 related search queries that would find relevant information. Return only the queries, one per line. Original question: {original query} Queries: """ } queries = response.content 0 .text.strip .split '\n' return q.strip for q in queries if q.strip Example original = "Real-time system experience?" queries = generate queries original Returns: "WebSockets real-time", "Socket.IO", "Redis pub/sub", "Message queues", "Real-time streaming" Now search each all results = for q in queries: results = vector search q all results.extend results Deduplicate and return top 5 unique results = list {r 'id' : r for r in all results}.values :5 This is elegant. The LLM understands what "real-time" means and generates good search queries. 👦 Nephew: Uncle, I built a RAG system. It works. But how do I know if it's good ? 👨🦳 Uncle: This is where most teams fail. They build, ship, and hope. No metrics. But you can measure. Here are the important metrics: 👦 Nephew: What should I measure? 👨🦳 Uncle: Four main things: | Metric | Measures | Target | Impact | |---|---|---|---| Recall | Of all relevant docs, how many found? | 90% | Missing information | Precision | Of what found, how relevant? | 85% | Wrong context | Latency | Response time | <2 seconds | User experience | Faithfulness | AI stays factual | 95% | Hallucinations | Let me explain each. 👦 Nephew: What's the difference? 👨🦳 Uncle: Let me give you an example. Say there are 10 Docker-related chunks in John's resume. Recall : Of those 10, how many did you find? High recall means: "I found most of the relevant information" Precision : Of what you returned, how many were relevant? High precision means: "Most of what I returned was useful" Usually there's a tradeoff: You want both high. 👦 Nephew: How do I actually measure this? 👨🦳 Uncle: You need labeled data. A test set. Here's code: python def calculate recall precision retrieved, relevant : """ retrieved: chunks your system found relevant: chunks a human labeled as relevant """ retrieved ids = set r 'id' for r in retrieved relevant ids = set r 'id' for r in relevant Recall: of all relevant, how many found? if len relevant ids == 0: recall = 1.0 else: recall = len retrieved ids & relevant ids / len relevant ids Precision: of what found, how many relevant? if len retrieved ids == 0: precision = 0.0 else: precision = len retrieved ids & relevant ids / len retrieved ids return recall, precision Example retrieved = {'id': '1', 'text': 'Docker experience'}, {'id': '2', 'text': 'Kubernetes'}, {'id': '3', 'text': 'Machine learning'}, relevant = {'id': '1', 'text': 'Docker experience'}, {'id': '2', 'text': 'Kubernetes'}, {'id': '4', 'text': 'Container orchestration'}, recall, precision = calculate recall precision retrieved, relevant Recall = 2/3 = 67% found 2 of 3 relevant Precision = 2/3 = 67% 2 of 3 returned were relevant 👦 Nephew: How do I measure speed? 👨🦳 Uncle: Simple: python import time start = time.time results = rag system.query "Does John have Docker experience?" latency = time.time - start 1000 Convert to ms print f"Latency: {latency}ms" Target: <2000ms 2 seconds . Anything faster is bonus. 👦 Nephew: What about cost? 👨🦳 Uncle: Important for production: Cost per query Embedding: $0.00001 per 1K tokens Reranking: $0.0001 per query if using paid service LLM answer: $0.001 per query if using Claude Total: ~$0.0011 per query If you do 1M queries/month: monthly cost = 1 000 000 0.0011 $1100/month This is what you need to know for business decisions. 👦 Nephew: Uncle, what's a hallucination? 👨🦳 Uncle: Here's a real example: Resume says: "Worked with React for 2 years" Hiring manager asks: "Experience with Kubernetes?" AI responds: "Yes, John has Kubernetes with orchestration." WRONG The resume never mentions Kubernetes. The AI invented an answer. It saw "Docker" or "containers" maybe , thought "Kubernetes is related to containers", and hallucinated. This is a hallucination. The AI is lying, but confidently. 👦 Nephew: But why does this happen? 👨🦳 Uncle: AIs are pattern-matching machines. They learned during training: "Docker" often appears near "Kubernetes". So the AI's brain has connected them. Even if you show the AI only a resume mentioning Docker, its training whispers: "And Kubernetes is usually near Docker " So the AI thinks: "Probably Kubernetes too." It's an artifact of how AIs learn. Not a bug. A feature used wrongly. 👦 Nephew: How do you stop it? 👨🦳 Uncle: Five layers of protection: Layer 1: Retrieval Boundaries Show the AI ONLY the retrieved chunks. Nothing from training. In your prompt: Based ONLY on the following information: Retrieved chunks Answer: Does John have Kubernetes experience? The word "ONLY" is critical. Layer 2: Structured Output Force JSON format. Constrain the answer shape. response = client.messages.create model="claude-3-5-sonnet-20241022", messages= ... , system=""" You MUST respond in JSON format: { "answer": "yes" or "no", "confidence": 0.0 to 1.0, "evidence": "chunk 1", "chunk 2" } """, This forces the AI to cite evidence or it can't answer. Layer 3: Validation Check the output against the original chunks: answer = json.loads response.content 0 .text Check: does evidence exist in chunks? for evidence in answer 'evidence' : if evidence not in chunks: Hallucination detected answer 'confidence' = 0 Reject return answer Layer 4: Confidence Gating Low confidence? Escalate to human: if answer 'confidence' < 0.7: Hand off to human return { 'answer': 'Unknown - escalated to human', 'reason': 'Low confidence' } Layer 5: Required Citations Every claim must point to a chunk: Question: Does John know Kubernetes? Retrieved chunks: 1 "Docker and container experience" 2 "Kubernetes cluster management" Answer: Yes, John knows Kubernetes from chunk 2 . The AI must cite. No citation? No answer. 👦 Nephew: Do you use all five? 👨🦳 Uncle: In production? All five. At different stages. Retrieval boundaries + structured output stop most hallucinations. Validation + confidence gating catch the rest. Citations let users verify. Together? Hallucinations become nearly impossible. 👦 Nephew: Uncle, my RAG system works on my laptop. But production is different? 👨🦳 Uncle: Completely different. Prototype version: Production version: These are not the same system. 👦 Nephew: How do I design it? 👨🦳 Uncle: Here's a proven architecture: Users ↓ Load Balancer Nginx / | \ API-1 API-2 API-3 scale horizontally \ | / ↓ Cache Layer Redis ↓ PostgreSQL + pgvector + pgvector Index ↓ External APIs Claude, etc. Load Balancer Multiple APIs Cache Database External APIs Each layer has a job. Each layer is replaceable. 👦 Nephew: Wait, what if I'm hosting this for multiple companies? 👨🦳 Uncle: Then you have a SERIOUS security issue if you don't handle isolation. Company A's data must NEVER be visible to Company B. Every single query must include: SELECT FROM resume chunks WHERE resume id = 'john-123' AND tenant id = 'company-a' ← THIS IS CRITICAL Without the tenant id check, a hacker or bug could leak data. This is not optional. This is "you'll be sued" level of important. 👦 Nephew: Uncle, I've heard of ATS. What is it? 👨🦳 Uncle: Old ATS systems just searched for keywords. "Does the resume have the word React?" Yes/No. Modern ATS - the kind you're building with RAG - is intelligent. It understands: Resume gets a score. Top scorers get interviewed. Bad scorers get rejected. The difference between old and new? Old ATS rejects good candidates because they phrased things differently. New ATS finds them anyway. 👦 Nephew: How do you score a resume? 👨🦳 Uncle: With components. Job requirements × candidate abilities. Job description requires: Let's say this resume has: Here's the scoring: python def score resume candidate, job requirements : """ Score a resume against job requirements. """ scores = {} Required skills: 60% weight required score = 0 required count = 0 for skill, years required in job requirements 'required' .items : required count += 1 candidate years = candidate.get skill, 0 if candidate years = years required: required score += 100 Full points elif candidate years 0: Partial credit required score += candidate years / years required 100 else: 0 points required avg = required score / required count if required count 0 else 0 Preferred skills: 25% weight preferred score = 0 preferred count = 0 for skill in job requirements 'preferred' : preferred count += 1 if skill in candidate: preferred score += 50 Half credit for preferred preferred avg = preferred score / preferred count if preferred count 0 else 0 Projects: 15% weight project score = len candidate.get 'projects', 25 project avg = min project score, 100 Cap at 100 Final score final score = required avg 0.60 + preferred avg 0.25 + project avg 0.15 return final score Example candidate = { 'React': 6, 'JavaScript': 7, 'Node.js': 4, 'projects': 'E-commerce app', 'Real-time chat' } requirements = { 'required': {'React': 5, 'JavaScript': 3}, 'preferred': 'Node.js', 'AWS' } score = score resume candidate, requirements React: 6 = 5 = 100 pts JavaScript: 7 = 3 = 100 pts Required avg: 100 Node.js: has it = 50 pts Preferred avg: 25 Projects: 2 × 25 = 50 pts Final: 100 × 0.60 + 25 × 0.25 + 50 × 0.15 = 60 + 6.25 + 7.5 = 73.75/100 That's scoring. But wait - how do you extract the data? 👦 Nephew: Resume might say "React" but another might say "React.js". How do you normalize? 👨🦳 Uncle: Build a dictionary. skill aliases = { 'React': 'React', 'React.js', 'ReactJS', 'react' , 'Node.js': 'Node.js', 'node.js', 'nodejs', 'Node' , 'Docker': 'Docker', 'docker', 'containers' , } def normalize skill mentioned skill : """Find the canonical skill name.""" mentioned lower = mentioned skill.lower for canonical, aliases in skill aliases.items : if mentioned lower in a.lower for a in aliases : return canonical return mentioned skill Unknown skill Example print normalize skill "React.js" Output: React print normalize skill "nodejs" Output: Node.js Now "React" and "React.js" are treated as the same skill. 👦 Nephew: But what if the resume mentions a skill not in your dictionary? 👨🦳 Uncle: Use embeddings to find similar skills. Resume mentions: "Built trading platform with real-time updates" Job requires: "WebSockets experience" These aren't in your dictionary. But you can check similarity: python def find skill match mentioned skill, required skill : """ Use embeddings to find matches for unknown skills. """ Get embeddings mentioned embedding = get embedding mentioned skill required embedding = get embedding required skill Calculate distance distance = cosine distance mentioned embedding, required embedding If similar enough, count as a match if distance 0.75: Threshold return True return False Example if find skill match "real-time data updates", "WebSockets" : Candidate probably has WebSockets experience score += 100 This is powerful. It catches variations and similar technologies. 👦 Nephew: Uncle, we have RAG working. What comes next? 👨🦳 Uncle: Basic RAG answers questions. Advanced RAG agents decides what to do. Difference: Basic RAG: "Does John know React?" → Find resume → Answer Agent RAG: "Is John qualified for Senior Developer role?" → The agent is intelligent about what to search. 👦 Nephew: How do I build an agent? 👨🦳 Uncle: Using Claude's tool use function calling : python import anthropic client = anthropic.Anthropic def search resume resume id, query : """Search candidate's resume.""" Your RAG search logic return rag system.query resume id, query def search job description job id, query : """Search job description.""" return job db.query job id, query def analyze candidate resume id, job id : """ Use Claude as an agent to analyze candidate fit. Claude decides what to search, what to compare. """ tools = { "name": "search resume", "description": "Search a candidate's resume for information", "input schema": { "type": "object", "properties": { "resume id": {"type": "string"}, "query": {"type": "string"} }, "required": "resume id", "query" } }, { "name": "search job", "description": "Search job requirements", "input schema": { "type": "object", "properties": { "job id": {"type": "string"}, "query": {"type": "string"} }, "required": "job id", "query" } } messages = { "role": "user", "content": f""" Analyze if candidate {resume id} is qualified for job {job id}. Use the available tools to: 1. Find required skills in the job description 2. Check if the candidate has those skills 3. Compare years of experience 4. Look for relevant projects 5. Make a recommendation Provide detailed reasoning. """ } Agent loop while True: response = client.messages.create model="claude-3-5-sonnet-20241022", max tokens=1000, tools=tools, messages=messages Check if Claude wants to use tools if response.stop reason == "tool use": Process tool calls tool results = for content in response.content: if content.type == "tool use": if content.name == "search resume": result = search resume content.input "resume id" , content.input "query" else: search job result = search job description content.input "job id" , content.input "query" tool results.append { "type": "tool result", "tool use id": content.id, "content": str result } Add Claude's response and tool results to messages messages.append {"role": "assistant", "content": response.content} messages.append {"role": "user", "content": tool results} else: Claude is done - extract final answer final answer = "" for content in response.content: if hasattr content, 'text' : final answer = content.text return final answer Use it result = analyze candidate "john-resume", "senior-dev-job" print result This is elegant. Claude is the brain. It decides what to search, in what order, and how to reason about it. 👦 Nephew: What's next? After agents? 👨🦳 Uncle: A few directions: 1. Adaptive RAG System learns what works. If a question fails, it tries different search strategies automatically. 2. Multi-modal RAG Right now we handle text. Soon: images, tables, videos. A resume with a photo of their projects. A video walkthrough. 3. Real-time RAG System continuously updates knowledge. Candidate updates LinkedIn → RAG system knows instantly. 4. Collaborative RAG Multiple agents reasoning together. One searches, one evaluates, one questions. 5. Explainable RAG System shows its work. "I rejected John because he doesn't have 5 years React, only 3 years, and React 5+ was required." These are coming. RAG is only beginning. 👨🦳 Uncle: Let me summarize what you've learned: 👦 Nephew: That's a lot, uncle. But I think I understand. RAG isn't magic - it's just engineering. 👨🦳 Uncle: Exactly. RAG is: Every layer serves one purpose: make AI useful for real people. 👦 Nephew: And production? 👨🦳 Uncle: Remember: Reliable for 10 users Perfect for nobody. Start simple. Build on PostgreSQL. Measure everything. Iterate fast. You don't need fancy systems. You need good engineering. 👦 Nephew: Uncle, one last question. 👨🦳 Uncle: Go ahead. 👦 Nephew: How do I actually start? Like, code? 👨🦳 Uncle: That's next lesson, beta. First, you understand the system. Then you build it. But I'll give you one gift: a simple starter stack: Frontend: Next.js + React Backend: Node.js + Express Database: PostgreSQL + pgvector Embedding: Claude API embeddings endpoint LLM: Claude API messages endpoint Hosting: AWS Lightsail or Render Simple. Proven. Scales. Now go build something amazing. python from anthropic import Anthropic client = Anthropic def create embeddings texts : """Get embeddings from Claude.""" response = client.messages.create model="claude-3-5-sonnet-20241022", max tokens=1024, system="You are an embedding service. Return ONLY JSON.", messages= { "role": "user", "content": f"Get embeddings for: {texts}" } return response.content 0 .text js SELECT chunk text, embedding <= query embedding AS vector distance, ts rank to tsvector chunk text , plainto tsquery 'search term' AS keyword rank FROM resume chunks WHERE resume id = 'candidate-uuid' AND to tsvector chunk text @@ plainto tsquery 'search term' ORDER BY vector distance ASC, keyword rank DESC LIMIT 5; python def score candidate candidate skills, required skills, weight=0.6 : """Simple scoring: % of required skills matched.""" matched = len s for s in candidate skills if s in required skills score = matched / len required skills 100 return score python def safe answer query, chunks, confidence threshold=0.7 : """Answer only if confident.""" response = client.messages.create model="claude-3-5-sonnet-20241022", max tokens=500, system=""" You MUST: 1. Answer ONLY using the provided chunks 2. Cite evidence 3. Return JSON with answer, confidence, evidence """, messages= { "role": "user", "content": f""" Chunks: {chunks} Question: {query} """ } answer = json.loads response.content 0 .text if answer 'confidence' < confidence threshold: return "Uncertain - escalated to human" return answer 'answer' RAG isn't magic. It's engineering: You've learned the foundations. Now use them. Build something reliable. Build something honest. Build something useful. You've got this. Created for developers who want to understand how RAG actually works, not just use it as a black box. Remember: Less noise, more action. Go build. Good luck. 🚀