RAG Pipeline: Complete Node.js Implementation Guide

wpnews.pro

Build production RAG systems in Node.js - Know where it breaks, why it works, and when to use it

👦 Nephew: Uncle, why would I build RAG in Node.js? I thought this was AI stuff?

👨🦳 Uncle: Good question. Node.js is perfect for RAG because:

Plus, you probably already have Node.js running your backend. Why add Python?

👦 Nephew: So I can build the whole thing in JavaScript?

👨🦳 Uncle: Yes. Frontend, backend, RAG - all JavaScript. That's the beauty.

But we need to be honest about limitations. Let's talk about that too.

mkdir rag-system
cd rag-system

npm init -y

npm install express dotenv @anthropic-ai/sdk pg pg-promise cors body-parser
npm install --save-dev nodemon typescript @types/node

npm install winston helmet compression
rag-system/
├── src/
│   ├── config/
│   │   ├── database.ts         # PostgreSQL + pgvector setup
│   │   └── embedding.ts         # Claude embeddings
│   ├── services/
│   │   ├── retrieval.ts         # Vector search logic
│   │   ├── reranking.ts         # Two-stage ranking
│   │   ├── queryProcessing.ts   # Query expansion
│   │   └── safety.ts            # Hallucination prevention
│   ├── routes/
│   │   └── rag.ts               # API endpoints
│   ├── utils/
│   │   ├── logger.ts            # Logging (critical for debugging)
│   │   └── metrics.ts           # Track recall, precision
│   └── index.ts                 # Main server
├── .env                          # Secrets
├── package.json
└── tsconfig.json
{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "lib": ["ES2020"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true
  }
}

👨🦳 Uncle: This is your foundation. Get it wrong, everything breaks.

-- Connect to PostgreSQL
psql -U postgres

-- Create database
CREATE DATABASE rag_system;

-- Connect to the database
\c rag_system

-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create resumes table
CREATE TABLE resumes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  candidate_name VARCHAR(255) NOT NULL,
  raw_text TEXT NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- CRITICAL: tenant isolation
  CONSTRAINT tenant_isolation UNIQUE(tenant_id, id)
);

-- Create chunks table (where vectors live)
CREATE TABLE resume_chunks (
  id SERIAL PRIMARY KEY,
  resume_id UUID NOT NULL REFERENCES resumes(id) ON DELETE CASCADE,
  tenant_id UUID NOT NULL,
  chunk_text TEXT NOT NULL,
  chunk_index INTEGER NOT NULL,

  -- The vector: 1536 dimensions for Claude embeddings
  embedding vector(1536) NOT NULL,

  -- Metadata for debugging
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- CRITICAL: Always check tenant
  CONSTRAINT tenant_isolation_chunks 
    FOREIGN KEY (tenant_id) REFERENCES tenants(id)
);

-- Create indexes
-- 1. Vector index for fast search (MOST IMPORTANT)
CREATE INDEX idx_resume_chunks_embedding 
ON resume_chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- 2. Tenant index (security)
CREATE INDEX idx_resume_chunks_tenant 
ON resume_chunks(tenant_id, resume_id);

-- 3. Text search index (keyword search)
CREATE INDEX idx_resume_chunks_text 
ON resume_chunks USING GIN (to_tsvector('english', chunk_text));

-- Create tenants table (multi-tenancy)
CREATE TABLE tenants (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name VARCHAR(255) NOT NULL,
  api_key VARCHAR(255) UNIQUE NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create query logs (for metrics)
CREATE TABLE query_logs (
  id SERIAL PRIMARY KEY,
  tenant_id UUID NOT NULL REFERENCES tenants(id),
  query TEXT NOT NULL,
  latency_ms INTEGER NOT NULL,
  recall DECIMAL(3,2),
  precision DECIMAL(3,2),
  cost_cents INTEGER,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create index on query logs for analytics
CREATE INDEX idx_query_logs_tenant 
ON query_logs(tenant_id, created_at DESC);

👨🦳 Uncle: This is where your first failure point lives.

// src/config/database.ts

import pgPromise from 'pg-promise';
import dotenv from 'dotenv';
import logger from '../utils/logger';

dotenv.config();

const initOptions = {
  // Detailed error info (critical for debugging)
  error(error: any, context: any) {
    logger.error('Database Error', {
      error: error.message,
      query: context.query,
      params: context.params
    });
  },

  // Connection events
  connect(client: any) {
    logger.info('Database connected');
  },

  disconnect(client: any) {
    logger.info('Database disconnected');
  }
};

const pgp = pgPromise(initOptions);

const db = pgp({
  host: process.env.DB_HOST || 'localhost',
  port: parseInt(process.env.DB_PORT || '5432'),
  database: process.env.DB_NAME || 'rag_system',
  user: process.env.DB_USER || 'postgres',
  password: process.env.DB_PASSWORD,

  // Connection pooling
  max: 20,

  // Timeout after 5 seconds
  connectionTimeoutMillis: 5000,

  // Idle timeout
  idleTimeoutMillis: 30000,
});

// Test connection on startup
export async function initializeDatabase() {
  try {
    await db.one('SELECT 1');
    logger.info('✓ Database connection verified');
  } catch (error) {
    logger.error('✗ Database connection failed', { error });
    process.exit(1);
  }
}

export default db;
DB_HOST=localhost
DB_PORT=5432
DB_NAME=rag_system
DB_USER=postgres
DB_PASSWORD=your_secure_password

ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022

NODE_ENV=production
PORT=3000

ADMIN_API_KEY=super-secret-key-change-this

👨🦳 Uncle: This is where the first real cost happens. Know what can fail here.

// src/config/embedding.ts

import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

interface EmbeddingResult {
  text: string;
  embedding: number[];
}

/**
 * Get embeddings for text chunks.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. API rate limit (429) - implements exponential backoff
 * 2. Token too long (4096 tokens max) - chunks pre-validated
 * 3. Network timeout - retry logic built in
 * 4. Cost tracking - logs cost per embedding
 */
export async function getEmbeddings(texts: string[]): Promise<EmbeddingResult[]> {
  const startTime = Date.now();

  try {
    // VALIDATION: Prevent token overrun
    // Claude's text embeddings: ~1 token = 4 chars average
    const validTexts = texts.map(text => {
      if (text.length > 16000) {  // ~4000 tokens
        logger.warn('Text truncated for embedding', { 
          originalLength: text.length,
          truncatedTo: 16000
        });
        return text.substring(0, 16000);
      }
      return text;
    });

    // Call Claude API for embeddings
    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: `Generate embeddings for the following texts. Return ONLY valid JSON array with "embeddings" key containing array of number arrays.

Texts:
${validTexts.map((t, i) => `${i}: ${t}`).join('\n\n')}

Return format: {"embeddings": [[...], [...], ...]}`
      }]
    });

    // Parse response
    const responseText = response.content[0].type === 'text' 
      ? response.content[0].text 
      : '';

    let embeddings: number[][];
    try {
      const parsed = JSON.parse(responseText);
      embeddings = parsed.embeddings || [];
    } catch (parseError) {
      logger.error('Failed to parse embeddings response', { 
        response: responseText.substring(0, 500) 
      });
      throw new Error('Invalid embeddings response format');
    }

    // Validate embeddings
    if (embeddings.length !== validTexts.length) {
      throw new Error(
        `Embedding count mismatch: got ${embeddings.length}, expected ${validTexts.length}`
      );
    }

    // Calculate cost (Claude 3.5 Sonnet: $0.003 per 1M input tokens)
    const inputTokens = response.usage.input_tokens;
    const costCents = (inputTokens / 1_000_000) * 0.003 * 100;

    const latency = Date.now() - startTime;
    logger.info('Embeddings generated', { 
      count: embeddings.length,
      latency,
      inputTokens,
      costCents: costCents.toFixed(4)
    });

    return validTexts.map((text, i) => ({
      text,
      embedding: embeddings[i]
    }));

  } catch (error: any) {
    logger.error('Embedding API error', {
      error: error.message,
      status: error.status
    });

    // Retry logic for rate limits
    if (error.status === 429) {
      logger.warn('Rate limited. Waiting before retry...');
      await new Promise(resolve => setTimeout(resolve, 5000));
      return getEmbeddings(texts); // Exponential backoff in real system
    }

    throw error;
  }
}

/**
 * Embed a single text (convenience function)
 */
export async function embedText(text: string): Promise<number[]> {
  const results = await getEmbeddings([text]);
  return results[0].embedding;
}

👨🦳 Uncle: Remember: 1000-1500 tokens, 200-token overlap.

// src/utils/chunking.ts

import logger from './logger';

interface Chunk {
  text: string;
  index: number;
  tokens: number;
}

/**
 * Break text into chunks with sliding window overlap.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. Overlap larger than chunk size
 * 2. Single chunk can't hold meaningful text
 */
export function chunkText(
  text: string,
  windowTokens: number = 1000,
  overlapTokens: number = 200
): Chunk[] {
  // Simple tokenization (1 token ≈ 4 chars for English)
  const estimatedTokens = Math.ceil(text.length / 4);

  if (estimatedTokens < windowTokens) {
    // Text is smaller than chunk size
    logger.debug('Text smaller than chunk window', { 
      estimatedTokens,
      windowTokens 
    });
    return [{
      text,
      index: 0,
      tokens: estimatedTokens
    }];
  }

  // Calculate character window (1 token ≈ 4 chars)
  const charWindow = windowTokens * 4;
  const charOverlap = overlapTokens * 4;
  const step = charWindow - charOverlap;

  const chunks: Chunk[] = [];
  let index = 0;

  for (let i = 0; i < text.length; i += step) {
    let end = i + charWindow;

    // Find sentence boundary to avoid splitting mid-sentence
    if (end < text.length) {
      const periodIndex = text.lastIndexOf('.', end);
      const newlineIndex = text.lastIndexOf('\n', end);
      const boundaryIndex = Math.max(periodIndex, newlineIndex);

      if (boundaryIndex > i + (charWindow * 0.8)) {
        // Found good boundary
        end = boundaryIndex + 1;
      }
    } else {
      end = text.length;
    }

    const chunk = text.substring(i, end).trim();

    if (chunk.length > 0) {
      chunks.push({
        text: chunk,
        index,
        tokens: Math.ceil(chunk.length / 4)
      });
      index++;
    }

    // Stop if we've reached the end
    if (end >= text.length) break;
  }

  logger.debug('Text chunked', {
    originalLength: text.length,
    chunkCount: chunks.length,
    avgChunkTokens: Math.round(
      chunks.reduce((sum, c) => sum + c.tokens, 0) / chunks.length
    )
  });

  return chunks;
}

👨🦳 Uncle: This is the heart. Where everything lives or dies.

// src/services/retrieval.ts

import db from '../config/database';
import { embedText } from '../config/embedding';
import logger from '../utils/logger';

interface RetrievalResult {
  chunkText: string;
  chunkIndex: number;
  vectorDistance: number;
  keywordScore: number;
  combinedScore: number;
}

/**
 * Retrieve relevant chunks using hybrid search.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. Missing tenant_id check → DATA BREACH
 * 2. Vector index not built → Slow queries (10s+ instead of 100ms)
 * 3. Query too long → API error
 * 4. No results → Need to handle gracefully
 * 5. Typos in query → Keyword search might fail
 */
export async function hybridSearch(
  tenantId: string,
  resumeId: string,
  query: string,
  topK: number = 5
): Promise<RetrievalResult[]> {
  const startTime = Date.now();

  try {
    // Validate inputs
    if (!tenantId || !resumeId) {
      throw new Error('tenant_id and resume_id are required');
    }

    if (query.length === 0) {
      throw new Error('Query cannot be empty');
    }

    if (query.length > 500) {
      logger.warn('Query truncated', { originalLength: query.length });
      query = query.substring(0, 500);
    }

    // Step 1: Get query embedding
    logger.debug('Embedding query', { query });
    const queryEmbedding = await embedText(query);

    // Step 2: Vector search (fast)
    // Convert embedding to PostgreSQL format: [0.1, 0.2, ...]
    const embeddingString = `[${queryEmbedding.join(',')}]`;

    const vectorResults = await db.manyOrNone(`
      SELECT 
        chunk_text,
        chunk_index,
        embedding <=> $1::vector AS vector_distance
      FROM resume_chunks
      WHERE 
        tenant_id = $2
        AND resume_id = $3
      ORDER BY vector_distance ASC
      LIMIT $4
    `, [embeddingString, tenantId, resumeId, topK * 2]); // Get 2x to filter

    if (vectorResults.length === 0) {
      logger.warn('No vector results found', { query, resumeId });
      return [];
    }

    // Step 3: Keyword filter (precision)
    // Only keep chunks that also match the query
    const keywordResults = await db.manyOrNone(`
      SELECT 
        chunk_text,
        chunk_index,
        ts_rank(
          to_tsvector('english', chunk_text), 
          plainto_tsquery('english', $1)
        ) AS keyword_score
      FROM resume_chunks
      WHERE 
        tenant_id = $2
        AND resume_id = $3
        AND to_tsvector('english', chunk_text) @@ 
            plainto_tsquery('english', $1)
      ORDER BY keyword_score DESC
      LIMIT $4
    `, [query, tenantId, resumeId, topK]);

    // Step 4: Combine results
    // Chunks that appear in both vector AND keyword search are best
    const combined = vectorResults
      .map(vr => {
        const kr = keywordResults.find(k => k.chunk_text === vr.chunk_text);
        return {
          ...vr,
          keywordScore: kr ? kr.keyword_score : 0,
          // Weighted score: 60% vector, 40% keyword
          combinedScore: (1 - vr.vector_distance) * 0.6 + (kr?.keyword_score || 0) * 0.4
        };
      })
      .sort((a, b) => b.combinedScore - a.combinedScore)
      .slice(0, topK);

    const latency = Date.now() - startTime;

    logger.info('Hybrid search complete', {
      query,
      resultsCount: combined.length,
      latency,
      vectorResultsCount: vectorResults.length,
      keywordResultsCount: keywordResults.length
    });

    // Log for metrics
    if (combined.length > 0) {
      await db.none(`
        INSERT INTO query_logs (tenant_id, query, latency_ms)
        VALUES ($1, $2, $3)
      `, [tenantId, query.substring(0, 255), latency]);
    }

    return combined as RetrievalResult[];

  } catch (error: any) {
    logger.error('Retrieval error', {
      error: error.message,
      query,
      resumeId,
      tenantId
    });
    throw error;
  }
}

/**
 * Multi-query retrieval - search with multiple variations.
 * 
 * Better recall, but slower and more expensive.
 */
export async function multiQueryRetrieval(
  tenantId: string,
  resumeId: string,
  queries: string[],
  topK: number = 5
): Promise<RetrievalResult[]> {
  try {
    const allResults: RetrievalResult[] = [];

    for (const query of queries) {
      const results = await hybridSearch(tenantId, resumeId, query, topK * 2);
      allResults.push(...results);
    }

    // Deduplicate by chunk text, keep highest score
    const unique = Array.from(
      allResults
        .reduce((map, item) => {
          const existing = map.get(item.chunkText);
          if (!existing || item.combinedScore > existing.combinedScore) {
            map.set(item.chunkText, item);
          }
          return map;
        }, new Map<string, RetrievalResult>())
        .values()
    );

    return unique
      .sort((a, b) => b.combinedScore - a.combinedScore)
      .slice(0, topK);

  } catch (error) {
    logger.error('Multi-query retrieval error', { error });
    throw error;
  }
}

👨🦳 Uncle: Two-stage is where quality happens. First stage is fast, second is accurate.

// src/services/reranking.ts

import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';

interface RerankedResult {
  text: string;
  score: number;
  rank: number;
}

/**
 * Rerank chunks using Claude (more accurate but slower).
 * 
 * ⚠️ FAILURE POINTS:
 * 1. Claude API timeout (fix with timeout wrapper)
 * 2. Chunks too long (truncate before sending)
 * 3. Response parsing fails
 * 4. Cost explosion (reranking costs money - track it)
 */
export async function rerank(
  query: string,
  chunks: string[],
  topK: number = 5
): Promise<RerankedResult[]> {
  const startTime = Date.now();

  try {
    if (chunks.length === 0) {
      return [];
    }

    const client = new Anthropic({
      apiKey: process.env.ANTHROPIC_API_KEY,
      timeout: 30 * 1000, // 30 second timeout
    });

    // Truncate chunks to prevent token overflow
    const truncatedChunks = chunks.map(c => 
      c.length > 2000 ? c.substring(0, 2000) + '...' : c
    );

    // Build reranking prompt
    const chunksFormatted = truncatedChunks
      .map((chunk, i) => `[${i}] ${chunk}`)
      .join('\n\n---\n\n');

    const prompt = `You are a search relevance expert. Rank the following chunks by relevance to the query.

Query: "${query}"

Chunks to rank:
${chunksFormatted}

Return ONLY valid JSON with this format:
{
  "rankings": [
    {"index": 0, "relevance_score": 0.95},
    {"index": 1, "relevance_score": 0.72}
  ]
}

Relevance score: 0.0 (irrelevant) to 1.0 (highly relevant)
Sort by relevance_score descending.`;

    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: prompt
      }]
    });

    // Parse response
    const responseText = response.content[0].type === 'text'
      ? response.content[0].text
      : '';

    let rankings: any[];
    try {
      // Extract JSON from response (might be wrapped in markdown)
      const jsonMatch = responseText.match(/\{[\s\S]*\}/);
      const jsonStr = jsonMatch ? jsonMatch[0] : responseText;
      const parsed = JSON.parse(jsonStr);
      rankings = parsed.rankings || [];
    } catch (parseError) {
      logger.error('Failed to parse reranking response', {
        response: responseText.substring(0, 500)
      });
      // Fallback: return original order
      return chunks.slice(0, topK).map((text, i) => ({
        text,
        score: 1.0 - (i * 0.1),
        rank: i + 1
      }));
    }

    // Convert to results
    const results = rankings
      .filter(r => r.index >= 0 && r.index < chunks.length)
      .map((r, rank) => ({
        text: chunks[r.index],
        score: r.relevance_score,
        rank: rank + 1
      }))
      .slice(0, topK);

    const latency = Date.now() - startTime;

    logger.info('Reranking complete', {
      query,
      inputCount: chunks.length,
      outputCount: results.length,
      latency,
      topScore: results[0]?.score
    });

    return results;

  } catch (error: any) {
    logger.error('Reranking error', {
      error: error.message,
      chunksCount: chunks.length,
      query: query.substring(0, 100)
    });

    // Fallback: return original order
    return chunks.slice(0, topK).map((text, i) => ({
      text,
      score: 1.0 - (i * 0.1),
      rank: i + 1
    }));
  }
}

👨🦳 Uncle: Expand the query so you find more relevant chunks.

// src/services/queryProcessing.ts

import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';

/**
 * Expand a query into related search terms.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. LLM generates irrelevant expansions
 * 2. Original query lost in expansion
 * 3. Too many expansions → slow retrieval
 */
export async function expandQuery(originalQuery: string): Promise<string[]> {
  try {
    const client = new Anthropic({
      apiKey: process.env.ANTHROPIC_API_KEY
    });

    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 200,
      messages: [{
        role: 'user',
        content: `Given this query about a job candidate, generate 2-3 alternative phrasings or related concepts that would help find relevant information.

Original query: "${originalQuery}"

Return ONLY a JSON array of strings:
["alternative1", "alternative2", "alternative3"]

These should help find the same information using different keywords.`
      }]
    });

    const responseText = response.content[0].type === 'text'
      ? response.content[0].text
      : '';

    let expansions: string[];
    try {
      expansions = JSON.parse(responseText);
    } catch (e) {
      logger.warn('Failed to parse query expansion', { response: responseText });
      return [originalQuery];
    }

    // Always include original query
    const allQueries = [originalQuery, ...expansions].filter(Boolean);

    logger.debug('Query expanded', {
      original: originalQuery,
      expansions: allQueries.length
    });

    return allQueries;

  } catch (error) {
    logger.error('Query expansion error', { error });
    return [originalQuery]; // Fallback
  }
}

/**
 * Normalize query (remove typos, standardize terms).
 */
export function normalizeQuery(query: string): string {
  return query
    .toLowerCase()
    .trim()
    // Remove extra spaces
    .replace(/\s+/g, ' ')
    // Remove special characters (keep alphanumeric and spaces)
    .replace(/[^\w\s]/g, '');
}

👨🦳 Uncle: This is where you prevent the AI from lying. Critical.

// src/services/safety.ts

import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';

interface SafeAnswer {
  answer: string;
  confidence: number;
  evidence: string[];
  isSafe: boolean;
  reason?: string;
}

/**
 * Get answer from AI with multiple safety layers.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. AI answer not in JSON format
 * 2. Confidence calculation wrong
 * 3. Evidence doesn't exist in chunks
 * 4. Excessive cost for failed attempts
 */
export async function safeAnswer(
  query: string,
  chunks: string[],
  confidenceThreshold: number = 0.7
): Promise<SafeAnswer> {
  const startTime = Date.now();

  try {
    if (chunks.length === 0) {
      return {
        answer: 'No relevant information found.',
        confidence: 0,
        evidence: [],
        isSafe: false,
        reason: 'No source chunks provided'
      };
    }

    const client = new Anthropic({
      apiKey: process.env.ANTHROPIC_API_KEY
    });

    // Layer 1: Retrieval boundaries
    // Show ONLY the chunks, nothing from training
    const chunksText = chunks
      .map((c, i) => `[Chunk ${i}]\n${c}`)
      .join('\n\n---\n\n');

    const prompt = `You are evaluating a candidate resume based on specific chunks.

INSTRUCTIONS:
1. Answer ONLY based on the provided chunks
2. Do NOT use any knowledge from training data
3. If information is not in chunks, say "Unknown"
4. Always cite which chunk supports your answer
5. Return valid JSON ONLY - no other text

Query: "${query}"

Chunks provided:
${chunksText}

Return JSON in this exact format:
{
  "answer": "your answer here",
  "confidence": 0.0 to 1.0,
  "evidence_chunks": [0, 1, 2],
  "explanation": "why you're confident"
}`;

    // Layer 2: Structured output
    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 500,
      messages: [{
        role: 'user',
        content: prompt
      }]
    });

    const responseText = response.content[0].type === 'text'
      ? response.content[0].text
      : '';

    // Parse response
    let parsed: any;
    try {
      const jsonMatch = responseText.match(/\{[\s\S]*\}/);
      const jsonStr = jsonMatch ? jsonMatch[0] : responseText;
      parsed = JSON.parse(jsonStr);
    } catch (e) {
      logger.error('Failed to parse safety response', {
        response: responseText.substring(0, 300)
      });
      return {
        answer: 'Error processing answer',
        confidence: 0,
        evidence: [],
        isSafe: false,
        reason: 'Invalid response format'
      };
    }

    // Layer 3: Validation
    // Check evidence chunks actually exist
    const validEvidenceIndices = (parsed.evidence_chunks || [])
      .filter((i: number) => i >= 0 && i < chunks.length);

    if (validEvidenceIndices.length === 0 && parsed.answer !== 'Unknown') {
      logger.warn('No valid evidence for answer', { 
        answer: parsed.answer,
        requestedIndices: parsed.evidence_chunks,
        chunksCount: chunks.length
      });
    }

    const evidence = validEvidenceIndices.map((i: number) => chunks[i]);

    // Layer 4: Confidence gating
    const isSafe = parsed.confidence >= confidenceThreshold;

    if (!isSafe) {
      logger.warn('Low confidence answer', {
        answer: parsed.answer,
        confidence: parsed.confidence,
        threshold: confidenceThreshold
      });
    }

    const latency = Date.now() - startTime;

    logger.info('Safe answer generated', {
      query: query.substring(0, 50),
      confidence: parsed.confidence,
      isSafe,
      latency,
      evidenceCount: evidence.length
    });

    return {
      answer: parsed.answer,
      confidence: parsed.confidence,
      evidence,
      isSafe,
      reason: isSafe ? 'Confident' : 'Low confidence'
    };

  } catch (error: any) {
    logger.error('Safety check error', { error: error.message });
    return {
      answer: 'Error',
      confidence: 0,
      evidence: [],
      isSafe: false,
      reason: error.message
    };
  }
}

/**
 * Validate that answer is faithful to evidence.
 * Post-check: does answer match the chunks?
 */
export async function validateFaithfulness(
  answer: string,
  evidence: string[],
  threshold: number = 0.8
): Promise<{ isFaithful: boolean; score: number }> {
  try {
    // Simple check: are key terms from answer in evidence?
    const answerTerms = answer.toLowerCase().split(/\s+/);
    const evidenceText = evidence.join(' ').toLowerCase();

    const matchedTerms = answerTerms.filter(term => 
      term.length > 3 && evidenceText.includes(term)
    );

    const score = answerTerms.length > 0 
      ? matchedTerms.length / answerTerms.length 
      : 0;

    return {
      isFaithful: score >= threshold,
      score
    };

  } catch (error) {
    logger.error('Faithfulness validation error', { error });
    return { isFaithful: false, score: 0 };
  }
}

👨🦳 Uncle: This is what the client calls. Make it robust.

// src/routes/rag.ts

import express, { Router, Request, Response } from 'express';
import db from '../config/database';
import { hybridSearch, multiQueryRetrieval } from '../services/retrieval';
import { rerank } from '../services/reranking';
import { expandQuery } from '../services/queryProcessing';
import { safeAnswer, validateFaithfulness } from '../services/safety';
import { chunkText } from '../utils/chunking';
import { getEmbeddings } from '../config/embedding';
import logger from '../utils/logger';

const router = Router();

// Middleware: Check authentication
function authMiddleware(req: Request, res: Response, next: Function) {
  const apiKey = req.headers['x-api-key'] as string;

  if (!apiKey) {
    return res.status(401).json({ error: 'Missing API key' });
  }

  // In production, validate against database
  if (apiKey !== process.env.ADMIN_API_KEY) {
    return res.status(401).json({ error: 'Invalid API key' });
  }

  next();
}

router.use(authMiddleware);

/**
 * Upload and process a resume.
 * POST /rag/upload
 */
router.post('/upload', async (req: Request, res: Response) => {
  try {
    const { tenantId, candidateName, resumeText } = req.body;

    if (!tenantId || !candidateName || !resumeText) {
      return res.status(400).json({ 
        error: 'Missing required fields: tenantId, candidateName, resumeText' 
      });
    }

    // Step 1: Save resume
    const resumeResult = await db.one(`
      INSERT INTO resumes (tenant_id, candidate_name, raw_text)
      VALUES ($1, $2, $3)
      RETURNING id
    `, [tenantId, candidateName, resumeText]);

    const resumeId = resumeResult.id;

    // Step 2: Chunk the resume
    const chunks = chunkText(resumeText, 1000, 200);
    logger.info('Resume chunked', { resumeId, chunkCount: chunks.length });

    // Step 3: Get embeddings for all chunks
    const chunkTexts = chunks.map(c => c.text);
    const embeddingResults = await getEmbeddings(chunkTexts);

    // Step 4: Save chunks with embeddings
    for (let i = 0; i < chunks.length; i++) {
      const chunk = chunks[i];
      const embedding = embeddingResults[i].embedding;
      const embeddingArray = `[${embedding.join(',')}]`;

      await db.none(`
        INSERT INTO resume_chunks 
        (resume_id, tenant_id, chunk_text, chunk_index, embedding)
        VALUES ($1, $2, $3, $4, $5::vector)
      `, [resumeId, tenantId, chunk.text, chunk.index, embeddingArray]);
    }

    logger.info('Resume uploaded successfully', { resumeId, chunkCount: chunks.length });

    res.json({
      success: true,
      resumeId,
      chunkCount: chunks.length,
      message: `Resume for ${candidateName} processed successfully`
    });

  } catch (error: any) {
    logger.error('Upload error', { error: error.message });
    res.status(500).json({ error: error.message });
  }
});

/**
 * Query a resume.
 * POST /rag/query
 */
router.post('/query', async (req: Request, res: Response) => {
  try {
    const { tenantId, resumeId, question, useExpansion = false } = req.body;

    if (!tenantId || !resumeId || !question) {
      return res.status(400).json({
        error: 'Missing required fields: tenantId, resumeId, question'
      });
    }

    const startTime = Date.now();

    // Step 1: Expand query if requested
    let queries = [question];
    if (useExpansion) {
      queries = await expandQuery(question);
      logger.debug('Query expanded', { count: queries.length });
    }

    // Step 2: Retrieve chunks (multi-query if expanded)
    const retrieved = useExpansion
      ? await multiQueryRetrieval(tenantId, resumeId, queries, 10)
      : await hybridSearch(tenantId, resumeId, question, 10);

    if (retrieved.length === 0) {
      return res.json({
        answer: 'No relevant information found in resume.',
        confidence: 0,
        evidence: [],
        isSafe: false,
        latency: Date.now() - startTime
      });
    }

    // Step 3: Rerank for accuracy
    const chunks = retrieved.map(r => r.chunkText);
    const reranked = await rerank(question, chunks, 5);
    const topChunks = reranked.map(r => r.text);

    // Step 4: Get safe answer with evidence
    const safeAns = await safeAnswer(question, topChunks, 0.7);

    // Step 5: Validate faithfulness (optional)
    const faithfulness = await validateFaithfulness(safeAns.answer, safeAns.evidence);

    const latency = Date.now() - startTime;

    res.json({
      success: true,
      answer: safeAns.answer,
      confidence: safeAns.confidence,
      evidence: safeAns.evidence,
      isSafe: safeAns.isSafe,
      faithfulness: faithfulness.score,
      latency,
      chunksRetrieved: retrieved.length,
      chunksReranked: reranked.length
    });

  } catch (error: any) {
    logger.error('Query error', { error: error.message });
    res.status(500).json({ error: error.message });
  }
});

/**
 * Get metrics for a tenant.
 * GET /rag/metrics/:tenantId
 */
router.get('/metrics/:tenantId', async (req: Request, res: Response) => {
  try {
    const { tenantId } = req.params;

    // Query logs aggregation
    const metrics = await db.one(`
      SELECT 
        COUNT(*) as query_count,
        AVG(latency_ms) as avg_latency,
        MAX(latency_ms) as max_latency,
        MIN(latency_ms) as min_latency,
        AVG(recall) as avg_recall,
        AVG(precision) as avg_precision,
        SUM(cost_cents) / 100.0 as total_cost_dollars
      FROM query_logs
      WHERE tenant_id = $1
    `, [tenantId]);

    res.json({
      success: true,
      metrics
    });

  } catch (error: any) {
    logger.error('Metrics error', { error: error.message });
    res.status(500).json({ error: error.message });
  }
});

export default router;
python
// src/index.ts

import express from 'express';
import cors from 'cors';
import compression from 'compression';
import helmet from 'helmet';
import dotenv from 'dotenv';
import ragRoutes from './routes/rag';
import { initializeDatabase } from './config/database';
import logger from './utils/logger';

dotenv.config();

const app = express();
const PORT = process.env.PORT || 3000;

// Middleware
app.use(helmet()); // Security headers
app.use(compression()); // Compress responses
app.use(cors());
app.use(express.json());

// Health check
app.get('/health', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});

// RAG routes
app.use('/rag', ragRoutes);

// Error handler
app.use((err: any, req: express.Request, res: express.Response, next: express.NextFunction) => {
  logger.error('Unhandled error', { error: err.message });
  res.status(500).json({ error: 'Internal server error' });
});

// Start server
async function start() {
  try {
    // Initialize database
    await initializeDatabase();

    app.listen(PORT, () => {
      logger.info(`Server running on port ${PORT}`);
    });
  } catch (error) {
    logger.error('Failed to start server', { error });
    process.exit(1);
  }
}

start();
python
// src/utils/logger.ts

import winston from 'winston';

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [
    // Console in development
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.printf(({ timestamp, level, message, ...meta }) => {
          return `${timestamp} [${level}] ${message} ${
            Object.keys(meta).length ? JSON.stringify(meta, null, 2) : ''
          }`;
        })
      )
    }),
    // File for production
    new winston.transports.File({ 
      filename: 'logs/error.log',
      level: 'error'
    }),
    new winston.transports.File({ 
      filename: 'logs/combined.log'
    })
  ]
});

export default logger;

Failure Point	Symptom	Root Cause
Missing tenant_id
Data leak between companies	No isolation check	Add WHERE tenant_id = X to EVERY query
Vector index missing
Queries take 10+ seconds	Sequential scan of 500K vectors	Create IVFFLAT index on embedding column
Query too long
API error 4096 tokens exceeded	Question >16000 chars	Truncate queries to 500 chars
No results
Empty array returned	Chunks don't exist or embeddings wrong	Check if chunks were saved, verify vector distance threshold
Hallucination
AI invents information	No retrieval boundaries in prompt	Use safety layers (5 layers as described)
Rate limit (429)
API call fails	Too many requests to Claude	Implement exponential backoff, queue requests
Database connection lost
"Cannot connect to server"	Network issue, DB down, wrong credentials	Add retry logic, connection pooling, health checks
Embedding dimension mismatch
"Vector dimension 1536 != 768"	Using different embedding model	Ensure consistent model (claude-3-5-sonnet)
Memory overload
Node.js crashes	Trying to embed entire 100MB file	Chunk before embedding, process in batches
Cost explosion
Unexpected $10k bill	Each embedding/rerank/answer costs money	Track costs, log them, set spending limits

Without RAG: "Does John know Docker?" → Guess → "maybe, looks like it"
With RAG: "Does John know Docker?" → Evidence → "Yes. His resume says: 'Docker, Kubernetes, 4 years'"
Single LLM: Simple
RAG: Embeddings → Vector DB → Retrieval → Reranking → Safety checks
Each stage can fail independently

Example problem:

Resume: "Worked with distributed ledger technology"
Search: "blockchain"
Result: Miss (DLT ≠ blockchain in embeddings)
100K queries/month = $100/month just for operations
(Not including infrastructure, salaries, etc)

User expects <200ms. RAG adds delay.

Resume full of typos: "React" → "Rreact" → Embeddings confused
→ System can't find React skills
→ Answer is wrong

Scenario	Use RAG?	Why
Customer support
YES	Up-to-date, explainable, no hallucinations
Medical diagnosis
YES	Safety-critical, needs evidence
Resume screening
YES	Domain-specific, needs accuracy
General chatbot
NO	Training data sufficient, latency matters
Quick facts
NO	Simple lookup is faster
Creative writing
NO	Hallucinations are features, not bugs
Code search
MAYBE	Depends on code freshness
Legal documents
YES	Must cite sources, no mistakes

1. Simple lookup → Use database
2. Conversational → Use base LLM
3. Speed critical → Too slow (600ms+)
4. Data quality poor → Garbage in/out
5. Training data sufficient → No value-add
6. Cost-sensitive → Each query costs money
Per-Query Costs (approximate):

1. Embedding query:
   - 50 tokens @ $0.000003/token = $0.00015

2. Vector search + keyword filter:
   - Database operation ≈ $0 (hosted: ~$0.00001)

3. Reranking (Claude):
   - 500 tokens input + 100 output @ $0.003/$0.015 = $0.0018

4. Final answer:
   - 500 tokens input + 200 output @ $0.003/$0.015 = $0.004

Total per query: ≈ $0.0075 (~0.75 cents)

At scale:
- 1K queries/month = $7.50
- 100K queries/month = $750
- 1M queries/month = $7,500

Cache results

Batch processing

Smart reranking

Cheaper models


FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY dist ./dist

EXPOSE 3000

CMD ["node", "dist/index.js"]

version: '3.8'
services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: rag_system
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  pgvector:
    build:
      context: .
      dockerfile: Dockerfile.pgvector
    environment:
      POSTGRES_PASSWORD: postgres
    depends_on:
      - postgres

  app:
    build: .
    environment:
      DB_HOST: postgres
      DB_USER: postgres
      DB_PASSWORD: postgres
      DB_NAME: rag_system
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
    ports:
      - "3000:3000"
    depends_on:
      - postgres

volumes:
  postgres_data:
curl -X POST http://localhost:3000/rag/upload \
  -H "Content-Type: application/json" \
  -H "X-API-Key: super-secret-key" \
  -d '{
    "tenantId": "company-1",
    "candidateName": "John Doe",
    "resumeText": "John has 5 years React experience, built e-commerce platforms with Node.js..."
  }'


curl -X POST http://localhost:3000/rag/query \
  -H "Content-Type: application/json" \
  -H "X-API-Key: super-secret-key" \
  -d '{
    "tenantId": "company-1",
    "resumeId": "uuid-123",
    "question": "Does John have React experience?",
    "useExpansion": true
  }'


curl -X GET http://localhost:3000/rag/metrics/company-1 \
  -H "X-API-Key: super-secret-key"

Cause: Missing vector index

Fix:

CREATE INDEX ON resume_chunks 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 100);

Cause: Stale embeddings or non-deterministic reranking

Fix: Always embed with same model, use fixed random seed for reranking

Cause: Insufficient safety layers

Fix: Add confidence thresholding, require citations, validate faithfulness

Cause: Missing tenant_id check in WHERE clause

Fix: Add WHERE tenant_id = $X to EVERY query

Cause: Too many rapid requests to Claude

Fix: Implement exponential backoff, queue requests, batch operations

Cause: No cost tracking, inefficient queries, excessive reranking

Fix: Log cost per operation, implement budgets, use cheaper models for easy tasks

👨🦳 Uncle's Final Word:

RAG is powerful but complex. Every layer serves a purpose:

You don't need all of this on day one. Start simple:

Day 1: PostgreSQL + embeddings + basic search

Week 1: Add reranking

Month 1: Add safety layers

Month 3: Add monitoring and optimization

Each layer buys you something. Know what you're buying.

👦 Nephew: When should I NOT use RAG?

👨🦳 Uncle: When:

Otherwise? RAG is the way.

Now go build. Start simple. Measure everything. Ship fast.

Good luck. ---SurajK

source & further reading

dev.to — original article The null input that broke my production agent and what fixed it AIchain Agent: Plan, Act, Reflect DOI to BibTeX converter - doesn't lowercase your acronyms or choke on ampersands

RAG Pipeline: Complete Node.js Implementation Guide

Run your AI side-project on zahid.host