{"slug": "i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-into", "title": "I built a RAG pipeline from scratch, and one wrong answer made me dive even deeper into AI Engineering", "summary": "A backend engineer built a RAG pipeline from scratch using Python, the Gemini API, and ChromaDB, avoiding high-level abstractions like LangChain to learn the fundamentals. After the system incorrectly answered a question about NestJS controllers in Portuguese, the engineer discovered that a chunking bug had caused the relevant document fragment to be excluded from the retrieved context. The experience deepened the engineer's understanding of AI engineering, particularly the critical role of chunking size and overlap in retrieval-augmented generation systems.", "body_md": "A backend engineer's first step into AI Engineering: embeddings, vector search, and the chunking bug that made everything click.\n\nI have been a backend engineer for a while now: TypeScript, NestJS, distributed systems, APIs in production. I like that work. But at some point I started paying attention to a specific career trajectory I came across: someone with a background almost identical to mine who had moved into AI Engineering. Not abandoned backend, extended it.\n\nThat reframed everything for me. This wasn't a pivot away from what I knew. It was a direction to grow into. And I decided to start from the fundamentals, not from the tooling.\n\nSo instead of installing LangChain and following a tutorial, I built a RAG pipeline from scratch, no abstractions, no magic. Just Python, the Gemini API, and ChromaDB. Here is what I learned.\n\nBefore writing a line of code, I needed a mental model that made sense to me as an engineer.\n\nRAG stands for Retrieval-Augmented Generation. The idea is simple: LLMs have frozen knowledge (their training cutoff) and a limited context window. You cannot feed an entire codebase or document library into a single prompt. RAG solves this by fetching only the relevant fragments at query time and injecting them into the context before the LLM responds.\n\nThink of it as hiring a brilliant consultant who knows nothing about your company. Instead of retraining them from scratch, you hand them the relevant documents before each meeting. That is RAG.\n\nThe pipeline has two phases:\n\n```\nINDEXING (runs once):\nDocument → chunking → embeddings → vector database\n\nQUERYING (runs on every question):\nQuestion → embedding → similarity search → top K chunks → LLM → answer\n```\n\nThe concept that unlocked everything for me was embeddings. An embedding is a vector, nothing more than a list of numbers, that represents the semantic meaning of a piece of text. Similar meanings produce similar vectors. Dissimilar meanings produce distant vectors.\n\nThis is not keyword matching. It is geometry. When you search a vector database, you are finding the nearest neighbors in a high-dimensional space. A question about \"payment processing failures\" can match a chunk that talks about \"error handling in transactions\", even if they share no words.\n\nThe model learned these relationships from co-occurrence patterns across billions of sentences. It never \"saw\" what a dog looks like, but it learned that \"dog\" and \"cat\" appear in similar contexts, pet care articles, veterinary advice, adoption stories, while \"car\" appears in entirely different ones. That contrast is encoded into their vector coordinates: dog and cat end up geometrically close, car ends up far away.\n\nIn my project, each chunk produced a vector with 3072 dimensions using gemini-embedding-001.\n\n```\nrag-project/\n├── src/\n│   ├── chunking.py      # text splitting logic\n│   ├── embeddings.py    # embedding generation via Gemini API\n│   ├── vector_store.py  # ChromaDB setup\n│   └── llm.py           # prompt construction and response generation\n├── main.py              # orchestrates the full pipeline\n└── .env                 # API keys\n```\n\nEach module exports only functions. No logic runs on import. main.py is the only place that decides what executes and in what order.\n\nChunking is dividing your document into fragments before generating embeddings. The size matters more than I expected.\n\n``` python\ndef chunk_text(text, chunk_size=400, overlap=50):\n    chunks = []\n    start = 0\n    while start < len(text):\n        end = start + chunk_size\n        chunk = text[start:end]\n        chunks.append(chunk)\n        start = end - overlap\n    return chunks\n```\n\nI asked the system (in Portuguese): \"O que são controllers no NestJS?\" — \"What are controllers in NestJS?\"\n\nThe response (in Portuguese): \"Não sabe.\" — \"Does not know\".\n\nThe LLM was Gemini. Gemini absolutely knows what NestJS controllers are. I had explicitly instructed it to answer only from the provided context — so when the context was wrong, it answered honestly that it did not know.\n\nI inspected the context being sent to the model:\n\n```\nControllers no NestJS são responsáveis  os controllers via injeção de dependência. (\"Controllers in NestJS are responsible the controllers via dependency injection.)\n```\n\nThe chunk had been cut in the middle of a sentence. The fix was increasing the chunk size from 200 to 400 characters. The system then answered correctly.\n\nThis is the failure mode that matters in production RAG. The pipeline does not crash. It runs perfectly and produces a wrong answer. The actual problem was upstream; in the chunking strategy.\n\nChunk size directly affects answer quality. Too small: the embedding captures a fragment without enough semantic content. Too large: the embedding averages over too much content and loses specificity.\n\nRAG is simpler to implement than I expected. The hard part is not the code, it is the judgment. Knowing when a chunk is too small. Knowing when retrieved context is semantically close but factually irrelevant. Knowing when to restrict the LLM to context and when to let it reason freely.\n\nThe libraries abstract the mechanics. The engineering is in the decisions around them.\n\nRetrieval quality determines answer quality. The LLM is the last step. If the chunks going in are wrong, no model in the world will produce a correct answer.\n\nThis was a minimal implementation on purpose. The next version will index a real corpus, the parsed books of A Song of Ice and Fire, with structure-aware chunking by chapter, metadata filters by POV character and book, and conversation history for a proper chatbot experience.\n\nAfter that: evals. Measuring whether the system actually answers correctly at scale is what separates a working demo from a production system.\n\nIf you are a backend engineer considering a move toward AI Engineering: start here. Build it without the frameworks first. The abstractions make much more sense once you know what they are hiding.", "url": "https://wpnews.pro/news/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-into", "canonical_source": "https://dev.to/felipearaujobs/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-deeper-into-ai-4npg", "published_at": "2026-05-30 02:53:17+00:00", "updated_at": "2026-05-30 03:11:42.818782+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "generative-ai", "ai-infrastructure"], "entities": ["Gemini", "ChromaDB", "LangChain", "TypeScript", "NestJS"], "alternates": {"html": "https://wpnews.pro/news/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-into", "markdown": "https://wpnews.pro/news/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-into.md", "text": "https://wpnews.pro/news/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-into.txt", "jsonld": "https://wpnews.pro/news/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-into.jsonld"}}