# I built a RAG pipeline from scratch, and one wrong answer made me dive even deeper into AI Engineering

> Source: <https://dev.to/felipearaujobs/i-built-a-rag-pipeline-from-scratch-and-one-wrong-answer-made-me-dive-even-deeper-into-ai-4npg>
> Published: 2026-05-30 02:53:17+00:00

A backend engineer's first step into AI Engineering: embeddings, vector search, and the chunking bug that made everything click.

I have been a backend engineer for a while now: TypeScript, NestJS, distributed systems, APIs in production. I like that work. But at some point I started paying attention to a specific career trajectory I came across: someone with a background almost identical to mine who had moved into AI Engineering. Not abandoned backend, extended it.

That reframed everything for me. This wasn't a pivot away from what I knew. It was a direction to grow into. And I decided to start from the fundamentals, not from the tooling.

So instead of installing LangChain and following a tutorial, I built a RAG pipeline from scratch, no abstractions, no magic. Just Python, the Gemini API, and ChromaDB. Here is what I learned.

Before writing a line of code, I needed a mental model that made sense to me as an engineer.

RAG stands for Retrieval-Augmented Generation. The idea is simple: LLMs have frozen knowledge (their training cutoff) and a limited context window. You cannot feed an entire codebase or document library into a single prompt. RAG solves this by fetching only the relevant fragments at query time and injecting them into the context before the LLM responds.

Think of it as hiring a brilliant consultant who knows nothing about your company. Instead of retraining them from scratch, you hand them the relevant documents before each meeting. That is RAG.

The pipeline has two phases:

```
INDEXING (runs once):
Document → chunking → embeddings → vector database

QUERYING (runs on every question):
Question → embedding → similarity search → top K chunks → LLM → answer
```

The concept that unlocked everything for me was embeddings. An embedding is a vector, nothing more than a list of numbers, that represents the semantic meaning of a piece of text. Similar meanings produce similar vectors. Dissimilar meanings produce distant vectors.

This is not keyword matching. It is geometry. When you search a vector database, you are finding the nearest neighbors in a high-dimensional space. A question about "payment processing failures" can match a chunk that talks about "error handling in transactions", even if they share no words.

The model learned these relationships from co-occurrence patterns across billions of sentences. It never "saw" what a dog looks like, but it learned that "dog" and "cat" appear in similar contexts, pet care articles, veterinary advice, adoption stories, while "car" appears in entirely different ones. That contrast is encoded into their vector coordinates: dog and cat end up geometrically close, car ends up far away.

In my project, each chunk produced a vector with 3072 dimensions using gemini-embedding-001.

```
rag-project/
├── src/
│   ├── chunking.py      # text splitting logic
│   ├── embeddings.py    # embedding generation via Gemini API
│   ├── vector_store.py  # ChromaDB setup
│   └── llm.py           # prompt construction and response generation
├── main.py              # orchestrates the full pipeline
└── .env                 # API keys
```

Each module exports only functions. No logic runs on import. main.py is the only place that decides what executes and in what order.

Chunking is dividing your document into fragments before generating embeddings. The size matters more than I expected.

``` python
def chunk_text(text, chunk_size=400, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks
```

I asked the system (in Portuguese): "O que são controllers no NestJS?" — "What are controllers in NestJS?"

The response (in Portuguese): "Não sabe." — "Does not know".

The LLM was Gemini. Gemini absolutely knows what NestJS controllers are. I had explicitly instructed it to answer only from the provided context — so when the context was wrong, it answered honestly that it did not know.

I inspected the context being sent to the model:

```
Controllers no NestJS são responsáveis  os controllers via injeção de dependência. ("Controllers in NestJS are responsible the controllers via dependency injection.)
```

The chunk had been cut in the middle of a sentence. The fix was increasing the chunk size from 200 to 400 characters. The system then answered correctly.

This is the failure mode that matters in production RAG. The pipeline does not crash. It runs perfectly and produces a wrong answer. The actual problem was upstream; in the chunking strategy.

Chunk size directly affects answer quality. Too small: the embedding captures a fragment without enough semantic content. Too large: the embedding averages over too much content and loses specificity.

RAG is simpler to implement than I expected. The hard part is not the code, it is the judgment. Knowing when a chunk is too small. Knowing when retrieved context is semantically close but factually irrelevant. Knowing when to restrict the LLM to context and when to let it reason freely.

The libraries abstract the mechanics. The engineering is in the decisions around them.

Retrieval quality determines answer quality. The LLM is the last step. If the chunks going in are wrong, no model in the world will produce a correct answer.

This was a minimal implementation on purpose. The next version will index a real corpus, the parsed books of A Song of Ice and Fire, with structure-aware chunking by chapter, metadata filters by POV character and book, and conversation history for a proper chatbot experience.

After that: evals. Measuring whether the system actually answers correctly at scale is what separates a working demo from a production system.

If you are a backend engineer considering a move toward AI Engineering: start here. Build it without the frameworks first. The abstractions make much more sense once you know what they are hiding.
