RAG Pipeline: The Uncle-Nephew Complete Learning Guide

wpnews.pro

How to Build Systems That Actually Know Your Data (Not Hallucinate About It)

👦 Nephew: Uncle, I keep hearing "RAG this, RAG that" in tech interviews. When I ask what it means, people throw around words like "Retrieval-Augmented Generation" and I just nod like I understand. But honestly? I'm lost.

👨🦳 Uncle: (laughing) That's the best honest question I've heard all week. Let me ask you something first. If I gave you a question right now - "What year did India win the World Cup?" - how would you answer?

👦 Nephew: Well... I'd pull up Google, search for it, read the answer, then tell you.

👨🦳 Uncle: Exactly. You don't answer from memory alone. You go fetch the information first, then answer based on what you found. That's RAG in real life. And that simple idea - fetch first, answer after - fixes almost every problem we face with AI today.

👦 Nephew: But uncle, AI can remember things from its training. Why does it need to fetch?

👨🦳 Uncle: Ah! That's where we land in trouble. Come, sit...

👨🦳 Uncle: Imagine you're hiring for a tech company. You receive 500 resumes for a Senior React Developer role. Now tell me - how would you actually process them?

👦 Nephew: I'd... probably make a spreadsheet? List all the candidates with key skills?

👨🦳 Uncle: Right. But here's the catch - you can't read all 500 resumes deeply. So what do you really do?

👦 Nephew: Skim for keywords like "React", "JavaScript", "5 years"?

👨🦳 Uncle: Exactly. You skim and hope you don't miss anyone good. Now, here's the problem: what if a candidate wrote "React.js" instead of "React"? Your eyes might still catch it. But a dumb computer doing exact string matching? It says "no match".

What if someone wrote "Built real-time user interfaces with the React framework"? The candidate clearly knows React, but the word "React" appears nowhere in that sentence. The computer misses them.

This is exactly what happens with AI. When you ask an AI a question, it tries to answer purely from what it memorized during training. And it makes mistakes - sometimes big ones.

👦 Nephew: So RAG stops these mistakes?

👨🦳 Uncle: Precisely. RAG is simple: instead of asking the AI to guess, you hand it the document first. You say, "Here's the resume, here's the job description - now answer my question."

The AI stops guessing. It reads the evidence. It answers correctly.

👦 Nephew: Okay, so RAG = give the AI documents, then ask questions?

👨🦳 Uncle: Yes, but we need to be precise about how we give it documents. RAG has three steps:

That's it. R-A-G.

👦 Nephew: But if it's that simple, why is everyone talking about it like it's complicated?

👨🦳 Uncle: Because "finding the right documents" is the hard part. You have 500 resumes. When someone asks "Does John have Docker experience?", you can't search all 500 linearly. That's slow.

And you can't use simple keyword search either - what if John wrote "containerization" instead of "Docker"? What if he wrote "I work with Kubernetes" - which means he knows Docker too?

So the real question becomes: How do you find the right documents fast, and how do you understand when two different words mean the same thing?

This is where everything else - embeddings, vector databases, chunking - comes in. They're all in service of solving that one problem.

👦 Nephew: Okay, I think I get the big picture. But uncle, why can't we just make the AI smarter?

👨🦳 Uncle: Two reasons. First - you train an AI once. After that, it doesn't learn new information. If your company has 100 internal policies created last month, the AI knows nothing about them. It can't learn them instantly.

Second - even if you could retrain it, hallucinations would still happen. AIs are pattern-matching machines. They're brilliant at patterns. But they sometimes see patterns that aren't there. RAG forces the AI to cite its sources, to point at evidence.

It's the difference between:

The second one is what you want. That's RAG.

👦 Nephew: Uncle, I have a question. How does a computer know that "React" and "React.js" are the same thing?

👨🦳 Uncle: That's the right question. Let me explain with a story.

I go to a fruit market. I see apples, mangoes, oranges. How do I know which is which? I look at them - color, shape, smell. My brain recognizes patterns and says "that's a mango".

Now, how does a computer do that? It doesn't have eyes. All it has are numbers.

👦 Nephew: Numbers?

👨🦳 Uncle: Yes. Here's the trick: every word - "React", "React.js", "JavaScript", "Python" - is just a number to a computer. Or actually, a list of numbers. We call this list an "embedding".

👦 Nephew: Like a list of... what kind of numbers?

👨🦳 Uncle: Imagine I'm describing a person to you. I might say:

That's 4 numbers describing 1 person. Now imagine I use 1536 numbers instead. Each number describes a different quality - not just physical things, but hidden things like "how technical is this word", "is this related to web development", "how often is this used", and so on.

So "React" becomes a point in a 1536-dimensional space. "React.js" becomes another point in that same space. And because they mean the same thing, those two points are very close to each other.

But "React" and "Python"? Those points are far apart.

👦 Nephew: Okay, so closeness = similarity?

👨🦳 Uncle: Exactly. And here's the beautiful part: you don't have to manually create these embeddings. An AI model does it for you. You feed it the word "React" and it says, "This word should be represented as [0.12, -0.45, 0.78, ..., 1536 numbers]".

Different models might represent it differently, but the same model always represents similar concepts closely. That's what matters.

👦 Nephew: So embeddings let computers understand meaning?

👨🦳 Uncle: They let computers represent meaning as a position in space. And when you represent things as positions, you can do math with them.

For example, if I tell you: "Engineer" - "Java" + "React" = ?

👦 Nephew: That's... weird math?

👨🦳 Uncle: Right? But with embeddings, you can actually do this. And you know what the answer is?

The vector closest to that result is usually "web developer" or something similar. The math captures meaning.

Now here's why this matters for RAG:

When someone asks "Does John have Docker experience?", we convert that question to an embedding - a point in 1536D space. Then we search for resume chunks that are close to that point. The chunks mentioning Docker - whether they say "Docker", "containerization", "container orchestration", or "I deploy with Docker" - are all close to the question's embedding.

So we find them all. And the AI reads them. And the AI answers correctly.

Without embeddings, simple keyword search would miss half of them.

👦 Nephew: Okay, so we have embeddings. But uncle, if I have a company with 10,000 resumes, and each resume has 50 chunks, that's 500,000 embeddings. Each embedding is 1536 numbers. How do I store and search that?

👨🦳 Uncle: This is where most people choose poorly. They think: "I'll use Pinecone! I'll use Milvus! I'll use Weaviate!"

They add three systems. One for storage, one for search, one for cache. Now they have 3 things that could break. 3 things to debug. 3 bills to pay.

There's a better way.

👦 Nephew: What?

👨🦳 Uncle: PostgreSQL.

👦 Nephew: The database? Just... PostgreSQL?

👨🦳 Uncle: With one addition: the pgvector extension. PostgreSQL is already reliable, it's already running your data, it's already scaling. We just add vector support.

Think about it this way: 500,000 vectors stored in PostgreSQL as data. When you have a question, you convert it to an embedding, and you say:

SELECT * FROM resume_chunks 
ORDER BY embedding <=> question_embedding 
LIMIT 5

That <=>

operator means "find the 5 vectors closest to my question vector". PostgreSQL finds them in milliseconds. You're done.

No new infrastructure. No vendor lock-in. Just one database doing one job well.

👦 Nephew: But doesn't the <=> operator need an index to be fast?

👨🦳 Uncle: Good catch. Yes. PostgreSQL's pgvector creates an IVFFLAT index by default. Think of it like this:

When you have 500,000 resumes and need to find the 5 closest to a question, searching all 500,000 linearly is slow. The index divides the 1536D space into regions (like neighborhoods). When you search, it says "The question's embedding is closest to this neighborhood" and only searches that neighborhood. Much faster.

A linear search of 500K vectors? 5 seconds. With the index? 50-100ms.

That's the magic of the right index.

👦 Nephew: How do I actually create this?

👨🦳 Uncle: Simple. PostgreSQL + pgvector. Here's what you do:

-- Create extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with vector column
CREATE TABLE resume_chunks (
  id SERIAL PRIMARY KEY,
  resume_id UUID,
  chunk_text TEXT,
  embedding vector(1536),
  created_at TIMESTAMP DEFAULT NOW()
);

-- Create index for fast search
CREATE INDEX ON resume_chunks USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Now you have a table. You insert chunks with their embeddings. You query with the <=>

operator. Done.

👦 Nephew: That's it?

👨🦳 Uncle: That's it. No API calls to external services. No paying per query. Just a single, reliable database.

This is the beauty of choosing the right tool.

👦 Nephew: Uncle, we have embeddings, we have a database. How does actual retrieval happen?

👨🦳 Uncle: Let me walk you through a real scenario. A hiring manager asks: "Does Priya have blockchain experience?"

Step 1: Understand the question

We need to find chunks about blockchain skills.

Step 2: Convert question to embedding

"blockchain experience" becomes a vector: [0.34, -0.12, ..., 1536 numbers]

Step 3: Search the database

We query:

SELECT chunk_text, similarity FROM resume_chunks 
WHERE resume_id = 'priya-uuid'
ORDER BY embedding <=> question_embedding
LIMIT 5

Step 4: Get results (might be):

Step 5: Send to AI

We put these 5 chunks in the prompt:

Based on the following resume excerpts:
[chunk 1]
[chunk 2]
[chunk 3]
[chunk 4]
[chunk 5]

Question: Does Priya have blockchain experience?

Step 6: AI answers

"Yes, Priya has strong blockchain experience. She has worked with Ethereum smart contracts, understands Bitcoin protocol, and has built on Web3/DeFi platforms using Solidity."

With evidence.

👦 Nephew: This seems perfect. What's the problem?

👨🦳 Uncle: The problem is subtle. Resume has this line:

"I worked on distributed ledger technology and consensus protocols"

This person knows blockchain. But the word "blockchain" is not there.

You search for "blockchain". Embeddings are smart, but not magic. "Distributed ledger technology" and "blockchain" are close but not identical.

Sometimes the vector search returns this chunk. Sometimes it doesn't.

Basic retrieval works maybe 70% of the time. For production, you need 95%+.

That's why we need advanced techniques.

👦 Nephew: So what makes it better?

👨🦳 Uncle: Multiple techniques working together. But we'll get there. First, let's talk about preparation.

👦 Nephew: Uncle, let me ask something. Why do we break resumes into chunks at all? Why not just embed the whole resume?

👨🦳 Uncle: Good question. Two reasons:

Cost - An embedding costs money (usually). A 50-page resume broken into chunks might cost $0.50 instead of $5 if you embed it whole.

Precision - If you embed the entire resume as one chunk, and you search for "Docker experience", the system returns the whole 50-page resume. The AI has to find the Docker mention itself. But if you break it into chunks, the system returns only the chunks about Docker. The AI reads less noise.

But here's the problem: how small should chunks be?

👦 Nephew: Can't I just pick a number? Like 500 tokens per chunk?

👨🦳 Uncle: Let's see what happens with different sizes.

Attempt 1: Too small (100 tokens)

Chunk 1: "John has"

Chunk 2: "5 years of"

Chunk 3: "React experience"

Problem: When we read Chunk 3, we've lost context. 5 years of what? The AI is confused.

Attempt 2: Right size (1000 tokens)

"John has 5 years of React experience, built e-commerce platforms, Redux for state management, and mentored junior developers."

Perfect. Context is clear.

Attempt 3: Too large (5000 tokens)

Entire work history section, with all jobs, all skills, everything mixed together.

Problem: Too much noise. When the AI searches for "React", it gets 5000 tokens to read, but only 2 sentences mention React. It's slow and confusing.

👨🦳 Uncle: The sweet spot is usually 1000-1500 tokens per chunk. Not less, not more.

👦 Nephew: What's overlap? Overlapping chunks?

👨🦳 Uncle: Yes. Imagine this:

Without overlap:

Chunk 1 ends: "...John used React..."

Chunk 2 starts: "...for building e-commerce..."

When you read Chunk 2 alone, you don't know John used React for this. Context is broken.

With 200-token overlap:

Chunk 1: "...John has React experience. He built e-commerce..."

Chunk 2: "...He built e-commerce with Redux and WebSockets..."

Now each chunk contains enough context. You can read them individually and still understand.

👦 Nephew: So overlap = context continuity?

👨🦳 Uncle: Exactly. It's like reading a book. Each page overlaps the previous one slightly. You always have context.

👦 Nephew: Are there different ways to chunk?

👨🦳 Uncle: Yes. Here are the main ones:

Strategy	Chunk Size	Overlap	Best For	Trade-off
Naive/Fixed	512 tokens	None	Simple code	Loses context
Sliding Window	1000-1500	200	Resumes, documents	Balanced
Semantic	Variable	Variable	Technical docs	Complex to implement
Recursive	1000 first, then smaller	200	Large documents	More processing

For resumes? Sliding window, 1000 tokens, 200-token overlap. It's what works best.

Here's example code:

def sliding_window_chunk(text, window_size=1000, overlap=200):
    """
    Break text into chunks with overlap.
    window_size = tokens per chunk
    overlap = tokens of overlap between chunks
    """
    tokens = text.split()  # Simplified tokenization
    chunks = []

    for i in range(0, len(tokens), window_size - overlap):
        chunk = tokens[i:i + window_size]
        chunks.append(" ".join(chunk))

        if i + window_size >= len(tokens):
            break

    return chunks

resume_text = "John has 5 years React... [long text]"
chunks = sliding_window_chunk(resume_text, 1000, 200)

Simple. Effective. This is what production systems use.

👦 Nephew: Uncle, embeddings are smart. But are they perfect?

👨🦳 Uncle: No. Here's a real problem:

Resume has: "Kubernetes 1.26"

You search for: "Kubernetes 1.26"

Vector search thinks:

FALSE MATCH! The versions are different, but vectors say they're similar.

👦 Nephew: So vectors miss exact matches?

👨🦳 Uncle: Not miss - they're fuzzy. They blur differences. Sometimes that's good (finding "React" when you search "React.js"). Sometimes it's bad (confusing version numbers).

This is why you need both: vectors AND keywords.

👦 Nephew: How does keyword search work?

👨🦳 Uncle: Simple. Exact string matching.

Resume: "Kubernetes 1.26"

Search: "1.26"

Result: EXACT MATCH. Yes.

Resume: "Kubernetes 1.25"

Search: "1.26"

Result: NO MATCH. No.

It's binary. No fuzzy. Just right or wrong.

👦 Nephew: So we use both?

👨🦳 Uncle: Exactly. Here's how:

Search "Kubernetes 1.26"

Step 1: Vector search on all chunks

Step 2: Keyword filter

Step 3: Combine and rerank

Result: Perfect accuracy. Both semantic understanding AND exact precision.

PostgreSQL can do this:

SELECT chunk_text 
FROM resume_chunks 
WHERE resume_id = 'john-uuid'
  AND to_tsvector(chunk_text) @@ plainto_tsquery('Kubernetes 1.26')
  AND embedding <=> question_embedding < 0.2
ORDER BY embedding <=> question_embedding
LIMIT 5

This query says:

Both working together.

👦 Nephew: When do I use pure vector?

👨🦳 Uncle: When you're searching for concepts:

Vectors are great here. They find related concepts even with different words.

👦 Nephew: And pure keyword?

👨🦳 Uncle: For exact facts:

Keywords are perfect. You don't need fuzzy matching.

👦 Nephew: And hybrid?

👨🦳 Uncle: For most real scenarios. Maximum accuracy. That's what production uses.

👦 Nephew: Uncle, we retrieve 5 chunks. But are they the best 5?

👨🦳 Uncle: That's a great question. Let me show you a problem:

Search: "Docker experience"

Vector search returns:

Items 3-5 shouldn't be in the top 5! Item 3 is negative, items 4-5 are weak.

Basic vector search ranks by similarity score. But similarity doesn't capture quality.

👦 Nephew: So we need a better ranker?

👨🦳 Uncle: Yes. And here's the trick: we use two models.

👦 Nephew: Two models? That sounds expensive.

👨🦳 Uncle: Exactly. So we use them smartly.

Stage 1: Fast retrieval (milliseconds)

Vector search returns 20 candidates. Speed matters here, so we use a fast model.

Stage 2: Accurate ranking (milliseconds)

We rerank those 20 with a better, slower model.

Total time: 500ms. Still fast. But now the top 5 are good.

The key insight: We don't apply the accurate model to all 500,000 chunks. That's too slow. We apply it only to 20. That's feasible.

👦 Nephew: What models should I use?

👨🦳 Uncle: A few options:

Reranker	Source	Quality	Cost
BGE-Reranker-Base	Open source	Very Good	Free
Cohere Rerank	API	Excellent	$$
Claude/GPT-4	API	Best	$$$

For production? Use BGE. It's free, open-source, and performs almost as well as expensive options.

Here's how:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('BAAI/bge-reranker-base')

top_20_chunks = [...list of chunks...]

scores = reranker.predict([[query, chunk] for chunk in top_20_chunks])

ranked = sorted(zip(top_20_chunks, scores), key=lambda x: x[1], reverse=True)

return ranked[:5]

That's it. Takes 300ms. Makes huge quality difference.

👦 Nephew: Uncle, when someone asks "Does John have frontend skills?", does the AI understand that means React, Vue, JavaScript, HTML?

👨🦳 Uncle: Not automatically. Here's the problem:

Resume has: "React, JavaScript, HTML, CSS expertise"

Search for: "frontend skills"

Vector search looks for the exact embedding of "frontend skills". But the resume doesn't have those words. It has the components.

A human reads "React, JavaScript, HTML, CSS" and immediately thinks "frontend". A computer doesn't make that leap.

👦 Nephew: So how do we fix it?

👨🦳 Uncle: Query rewriting. You expand the query.

👦 Nephew: Expand it how?

👨🦳 Uncle: Instead of searching for "frontend skills", you search for:

(frontend OR React OR Vue OR Angular OR JavaScript OR HTML OR CSS)

Now you find any mention of these. Much better.

Here's code:

def expand_query(query):
    """
    Expand a query into related terms.
    """
    expansions = {
        'frontend': ['React', 'Vue', 'Angular', 'JavaScript', 'HTML', 'CSS'],
        'backend': ['Node.js', 'Python', 'Java', 'databases', 'APIs'],
        'devops': ['Docker', 'Kubernetes', 'AWS', 'CI/CD', 'deployment'],
    }

    for key, synonyms in expansions.items():
        if key in query.lower():
            return query + " OR " + " OR ".join(synonyms)

    return query

original = "frontend skills"
expanded = expand_query(original)

👦 Nephew: Is there a smarter way?

👨🦳 Uncle: Yes! Instead of manually expanding, let an LLM generate variations.

User asks: "Real-time system experience?"

LLM thinks: "That means WebSockets, Socket.IO, real-time updates, pub/sub, message queues"

LLM generates multiple queries:

You search all 5 queries. You get chunks from all angles. You combine results.

Coverage: 10x better.

from anthropic import Anthropic

def generate_queries(original_query):
    """
    Generate multiple search queries from one.
    """
    client = Anthropic()

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""
Given this question, generate 3-5 related search queries 
that would find relevant information. Return only the queries, 
one per line.

Original question: {original_query}

Queries:
"""
        }]
    )

    queries = response.content[0].text.strip().split('\n')
    return [q.strip() for q in queries if q.strip()]

original = "Real-time system experience?"
queries = generate_queries(original)

all_results = []
for q in queries:
    results = vector_search(q)
    all_results.extend(results)

unique_results = list({r['id']: r for r in all_results}.values())[:5]

This is elegant. The LLM understands what "real-time" means and generates good search queries.

👦 Nephew: Uncle, I built a RAG system. It works. But how do I know if it's good?

👨🦳 Uncle: This is where most teams fail. They build, ship, and hope. No metrics.

But you can measure. Here are the important metrics:

👦 Nephew: What should I measure?

👨🦳 Uncle: Four main things:

Metric	Measures	Target
Recall
Of all relevant docs, how many found?	>90%	Missing information
Precision
Of what found, how relevant?	>85%	Wrong context
Latency
Response time	<2 seconds	User experience
Faithfulness
AI stays factual	>95%	Hallucinations

Let me explain each.

👦 Nephew: What's the difference?

👨🦳 Uncle: Let me give you an example. Say there are 10 Docker-related chunks in John's resume.

Recall: Of those 10, how many did you find?

High recall means: "I found most of the relevant information"

Precision: Of what you returned, how many were relevant?

High precision means: "Most of what I returned was useful"

Usually there's a tradeoff:

You want both high.

👦 Nephew: How do I actually measure this?

👨🦳 Uncle: You need labeled data. A test set.

Here's code:

def calculate_recall_precision(retrieved, relevant):
    """
    retrieved: chunks your system found
    relevant: chunks a human labeled as relevant
    """
    retrieved_ids = set(r['id'] for r in retrieved)
    relevant_ids = set(r['id'] for r in relevant)

    if len(relevant_ids) == 0:
        recall = 1.0
    else:
        recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)

    if len(retrieved_ids) == 0:
        precision = 0.0
    else:
        precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)

    return recall, precision

retrieved = [
    {'id': '1', 'text': 'Docker experience'},
    {'id': '2', 'text': 'Kubernetes'},
    {'id': '3', 'text': 'Machine learning'},
]

relevant = [
    {'id': '1', 'text': 'Docker experience'},
    {'id': '2', 'text': 'Kubernetes'},
    {'id': '4', 'text': 'Container orchestration'},
]

recall, precision = calculate_recall_precision(retrieved, relevant)

👦 Nephew: How do I measure speed?

👨🦳 Uncle: Simple:

import time

start = time.time()
results = rag_system.query("Does John have Docker experience?")
latency = (time.time() - start) * 1000  # Convert to ms

print(f"Latency: {latency}ms")

Target: <2000ms (2 seconds). Anything faster is bonus.

👦 Nephew: What about cost?

👨🦳 Uncle: Important for production:


monthly_cost = 1_000_000 * 0.0011  # $1100/month

This is what you need to know for business decisions.

👦 Nephew: Uncle, what's a hallucination?

👨🦳 Uncle: Here's a real example:

Resume says: "Worked with React for 2 years"

Hiring manager asks: "Experience with Kubernetes?"

AI responds: "Yes, John has Kubernetes with orchestration."

WRONG! The resume never mentions Kubernetes.

The AI invented an answer. It saw "Docker" or "containers" (maybe), thought "Kubernetes is related to containers", and hallucinated.

This is a hallucination. The AI is lying, but confidently.

👦 Nephew: But why does this happen?

👨🦳 Uncle: AIs are pattern-matching machines. They learned during training: "Docker" often appears near "Kubernetes". So the AI's brain has connected them.

Even if you show the AI only a resume mentioning Docker, its training whispers: "And Kubernetes is usually near Docker!"

So the AI thinks: "Probably Kubernetes too."

It's an artifact of how AIs learn. Not a bug. A feature used wrongly.

👦 Nephew: How do you stop it?

👨🦳 Uncle: Five layers of protection:

Layer 1: Retrieval Boundaries

Show the AI ONLY the retrieved chunks. Nothing from training.

In your prompt:

Based ONLY on the following information:

[Retrieved chunks]

Answer: Does John have Kubernetes experience?

The word "ONLY" is critical.

Layer 2: Structured Output

Force JSON format. Constrain the answer shape.

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    messages=[...],
    system="""
You MUST respond in JSON format:
{
  "answer": "yes" or "no",
  "confidence": 0.0 to 1.0,
  "evidence": ["chunk 1", "chunk 2"]
}
""",
)

This forces the AI to cite evidence or it can't answer.

Layer 3: Validation

Check the output against the original chunks:

answer = json.loads(response.content[0].text)

for evidence in answer['evidence']:
    if evidence not in chunks:
        answer['confidence'] = 0  # Reject

return answer

Layer 4: Confidence Gating

Low confidence? Escalate to human:

if answer['confidence'] < 0.7:
    return {
        'answer': 'Unknown - escalated to human',
        'reason': 'Low confidence'
    }

Layer 5: Required Citations

Every claim must point to a chunk:

Question: Does John know Kubernetes?
Retrieved chunks:
[1] "Docker and container experience"
[2] "Kubernetes cluster management"

Answer: Yes, John knows Kubernetes (from chunk 2).

The AI must cite. No citation? No answer.

👦 Nephew: Do you use all five?

👨🦳 Uncle: In production? All five. At different stages.

Retrieval boundaries + structured output stop most hallucinations.

Validation + confidence gating catch the rest.

Citations let users verify.

Together? Hallucinations become nearly impossible.

👦 Nephew: Uncle, my RAG system works on my laptop. But production is different?

👨🦳 Uncle: Completely different.

Prototype version:

Production version:

These are not the same system.

👦 Nephew: How do I design it?

👨🦳 Uncle: Here's a proven architecture:

                    Users
                      ↓
              Load Balancer (Nginx)
             /          |          \
         API-1      API-2      API-3  (scale horizontally)
             \          |          /
                      ↓
                 Cache Layer (Redis)
                      ↓
           PostgreSQL + pgvector + pgvector Index
                      ↓
              External APIs (Claude, etc.)

Load Balancer

Multiple APIs

Cache

Database

External APIs

Each layer has a job. Each layer is replaceable.

👦 Nephew: Wait, what if I'm hosting this for multiple companies?

👨🦳 Uncle: Then you have a SERIOUS security issue if you don't handle isolation.

Company A's data must NEVER be visible to Company B.

Every single query must include:

SELECT * FROM resume_chunks 
WHERE resume_id = 'john-123'
  AND tenant_id = 'company-a'  ← THIS IS CRITICAL

Without the tenant_id

check, a hacker or bug could leak data.

This is not optional. This is "you'll be sued" level of important.

👦 Nephew: Uncle, I've heard of ATS. What is it?

👨🦳 Uncle: Old ATS systems just searched for keywords. "Does the resume have the word React?" Yes/No.

Modern ATS - the kind you're building with RAG - is intelligent. It understands:

Resume gets a score. Top scorers get interviewed. Bad scorers get rejected.

The difference between old and new? Old ATS rejects good candidates because they phrased things differently. New ATS finds them anyway.

👦 Nephew: How do you score a resume?

👨🦳 Uncle: With components. Job requirements × candidate abilities.

Job description requires:

Let's say this resume has:

Here's the scoring:

def score_resume(candidate, job_requirements):
    """
    Score a resume against job requirements.
    """
    scores = {}

    required_score = 0
    required_count = 0
    for skill, years_required in job_requirements['required'].items():
        required_count += 1
        candidate_years = candidate.get(skill, 0)

        if candidate_years >= years_required:
            required_score += 100  # Full points
        elif candidate_years > 0:
            required_score += (candidate_years / years_required) * 100

    required_avg = required_score / required_count if required_count > 0 else 0

    preferred_score = 0
    preferred_count = 0
    for skill in job_requirements['preferred']:
        preferred_count += 1
        if skill in candidate:
            preferred_score += 50  # Half credit for preferred

    preferred_avg = preferred_score / preferred_count if preferred_count > 0 else 0

    project_score = len(candidate.get('projects', [])) * 25
    project_avg = min(project_score, 100)  # Cap at 100

    final_score = (
        required_avg * 0.60 +
        preferred_avg * 0.25 +
        project_avg * 0.15
    )

    return final_score

candidate = {
    'React': 6,
    'JavaScript': 7,
    'Node.js': 4,
    'projects': ['E-commerce app', 'Real-time chat']
}

requirements = {
    'required': {'React': 5, 'JavaScript': 3},
    'preferred': ['Node.js', 'AWS']
}

score = score_resume(candidate, requirements)

That's scoring. But wait - how do you extract the data?

👦 Nephew: Resume might say "React" but another might say "React.js". How do you normalize?

👨🦳 Uncle: Build a dictionary.

skill_aliases = {
    'React': ['React', 'React.js', 'ReactJS', 'react'],
    'Node.js': ['Node.js', 'node.js', 'nodejs', 'Node'],
    'Docker': ['Docker', 'docker', 'containers'],
}

def normalize_skill(mentioned_skill):
    """Find the canonical skill name."""
    mentioned_lower = mentioned_skill.lower()

    for canonical, aliases in skill_aliases.items():
        if mentioned_lower in [a.lower() for a in aliases]:
            return canonical

    return mentioned_skill  # Unknown skill

print(normalize_skill("React.js"))  # Output: React
print(normalize_skill("nodejs"))    # Output: Node.js

Now "React" and "React.js" are treated as the same skill.

👦 Nephew: But what if the resume mentions a skill not in your dictionary?

👨🦳 Uncle: Use embeddings to find similar skills.

Resume mentions: "Built trading platform with real-time updates"

Job requires: "WebSockets experience"

These aren't in your dictionary. But you can check similarity:

def find_skill_match(mentioned_skill, required_skill):
    """
    Use embeddings to find matches for unknown skills.
    """
    mentioned_embedding = get_embedding(mentioned_skill)
    required_embedding = get_embedding(required_skill)

    distance = cosine_distance(mentioned_embedding, required_embedding)

    if distance > 0.75:  # Threshold
        return True

    return False

if find_skill_match("real-time data updates", "WebSockets"):
    score += 100

This is powerful. It catches variations and similar technologies.

👦 Nephew: Uncle, we have RAG working. What comes next?

👨🦳 Uncle: Basic RAG answers questions. Advanced RAG (agents) decides what to do.

Difference:

Basic RAG: "Does John know React?" → Find resume → Answer

Agent RAG: "Is John qualified for Senior Developer role?" →

The agent is intelligent about what to search.

👦 Nephew: How do I build an agent?

👨🦳 Uncle: Using Claude's tool use (function calling):

import anthropic

client = anthropic.Anthropic()

def search_resume(resume_id, query):
    """Search candidate's resume."""
    return rag_system.query(resume_id, query)

def search_job_description(job_id, query):
    """Search job description."""
    return job_db.query(job_id, query)

def analyze_candidate(resume_id, job_id):
    """
    Use Claude as an agent to analyze candidate fit.
    Claude decides what to search, what to compare.
    """
    tools = [
        {
            "name": "search_resume",
            "description": "Search a candidate's resume for information",
            "input_schema": {
                "type": "object",
                "properties": {
                    "resume_id": {"type": "string"},
                    "query": {"type": "string"}
                },
                "required": ["resume_id", "query"]
            }
        },
        {
            "name": "search_job",
            "description": "Search job requirements",
            "input_schema": {
                "type": "object",
                "properties": {
                    "job_id": {"type": "string"},
                    "query": {"type": "string"}
                },
                "required": ["job_id", "query"]
            }
        }
    ]

    messages = [{
        "role": "user",
        "content": f"""
Analyze if candidate {resume_id} is qualified for job {job_id}.

Use the available tools to:
1. Find required skills in the job description
2. Check if the candidate has those skills
3. Compare years of experience
4. Look for relevant projects
5. Make a recommendation

Provide detailed reasoning.
"""
    }]

    while True:
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "tool_use":
            tool_results = []
            for content in response.content:
                if content.type == "tool_use":
                    if content.name == "search_resume":
                        result = search_resume(
                            content.input["resume_id"],
                            content.input["query"]
                        )
                    else:  # search_job
                        result = search_job_description(
                            content.input["job_id"],
                            content.input["query"]
                        )

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": content.id,
                        "content": str(result)
                    })

            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

        else:
            final_answer = ""
            for content in response.content:
                if hasattr(content, 'text'):
                    final_answer = content.text

            return final_answer

result = analyze_candidate("john-resume", "senior-dev-job")
print(result)

This is elegant. Claude is the brain. It decides what to search, in what order, and how to reason about it.

👦 Nephew: What's next? After agents?

👨🦳 Uncle: A few directions:

1. Adaptive RAG

System learns what works. If a question fails, it tries different search strategies automatically.

2. Multi-modal RAG

Right now we handle text. Soon: images, tables, videos. A resume with a photo of their projects. A video walkthrough.

3. Real-time RAG

System continuously updates knowledge. Candidate updates LinkedIn → RAG system knows instantly.

4. Collaborative RAG

Multiple agents reasoning together. One searches, one evaluates, one questions.

5. Explainable RAG

System shows its work. "I rejected John because he doesn't have 5 years React, only 3 years, and React 5+ was required."

These are coming. RAG is only beginning.

👨🦳 Uncle: Let me summarize what you've learned:

👦 Nephew: That's a lot, uncle. But I think I understand. RAG isn't magic - it's just engineering.

👨🦳 Uncle: Exactly. RAG is:

Every layer serves one purpose: make AI useful for real people.

👦 Nephew: And production?

👨🦳 Uncle: Remember: Reliable for 10 users >> Perfect for nobody.

Start simple. Build on PostgreSQL. Measure everything. Iterate fast.

You don't need fancy systems. You need good engineering.

👦 Nephew: Uncle, one last question.

👨🦳 Uncle: Go ahead.

👦 Nephew: How do I actually start? Like, code?

👨🦳 Uncle: That's next lesson, beta. First, you understand the system. Then you build it.

But I'll give you one gift: a simple starter stack:

Frontend: Next.js + React
Backend: Node.js + Express
Database: PostgreSQL + pgvector
Embedding: Claude API (embeddings endpoint)
LLM: Claude API (messages endpoint)
Hosting: AWS Lightsail or Render

Simple. Proven. Scales.

Now go build something amazing.

from anthropic import Anthropic

client = Anthropic()

def create_embeddings(texts):
    """Get embeddings from Claude."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system="You are an embedding service. Return ONLY JSON.",
        messages=[{
            "role": "user",
            "content": f"Get embeddings for: {texts}"
        }]
    )
    return response.content[0].text
js
SELECT chunk_text, 
       embedding <=> query_embedding AS vector_distance,
       ts_rank(to_tsvector(chunk_text), 
               plainto_tsquery('search_term')) AS keyword_rank
FROM resume_chunks
WHERE resume_id = 'candidate-uuid'
  AND to_tsvector(chunk_text) @@ plainto_tsquery('search_term')
ORDER BY vector_distance ASC, keyword_rank DESC
LIMIT 5;
python
def score_candidate(candidate_skills, required_skills, weight=0.6):
    """Simple scoring: % of required skills matched."""
    matched = len([s for s in candidate_skills if s in required_skills])
    score = (matched / len(required_skills)) * 100
    return score
python
def safe_answer(query, chunks, confidence_threshold=0.7):
    """Answer only if confident."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        system="""
You MUST:
1. Answer ONLY using the provided chunks
2. Cite evidence
3. Return JSON with answer, confidence, evidence
""",
        messages=[{
            "role": "user",
            "content": f"""
Chunks: {chunks}
Question: {query}
"""
        }]
    )

    answer = json.loads(response.content[0].text)

    if answer['confidence'] < confidence_threshold:
        return "Uncertain - escalated to human"

    return answer['answer']

RAG isn't magic. It's engineering:

You've learned the foundations. Now use them.

Build something reliable.

Build something honest.

Build something useful.

You've got this.

Created for developers who want to understand how RAG actually works, not just use it as a black box.

Remember: Less noise, more action.

Go build. Good luck. 🚀

source & further reading

dev.to — original article AIchain Agent: Plan, Act, Reflect DOI to BibTeX converter - doesn't lowercase your acronyms or choke on ampersands 60–95% fewer tokens in your agent loops, same answers. Meet Headroom.

RAG Pipeline: The Uncle-Nephew Complete Learning Guide

Run your AI side-project on zahid.host