{"slug": "building-a-rag-system-from-scratch-design-decisions-explained", "title": "Building a RAG System from Scratch — Design Decisions Explained", "summary": "A developer built a RAG system from scratch using pgvector and Gemini embeddings, explaining design decisions such as choosing pgvector over dedicated vector databases, using 768-dimensional embeddings, and employing different task types for ingestion and querying. The system uses HNSW indexing and Gemini 2.5 Flash for answer generation, with a scaling plan from local pgvector to managed cloud solutions.", "body_md": "In the [previous article](https://dev.to/hiroki-kameyama/building-a-rag-system-from-scratch-with-pgvector-and-gemini-implementation-3n28), we built a working RAG pipeline. Now let's step back and ask *why* we made each design decision — and what alternatives exist when your requirements change.\n\nHere's what we built:\n\n```\nIngest phase\n  Text → gemini-embedding-001 (RETRIEVAL_DOCUMENT, 768 dims)\n       → pgvector (HNSW index, cosine similarity)\n\nQuery phase\n  Question → gemini-embedding-001 (RETRIEVAL_QUERY, 768 dims)\n           → pgvector search (top-k)\n           → Gemini 2.5 Flash (answer generation)\n```\n\nEvery element in this diagram was a choice. Let's examine each one.\n\nWe used pgvector, a PostgreSQL extension, rather than a purpose-built vector database like Pinecone, Weaviate, or Qdrant.\n\n**Why pgvector works here:**\n\n`category`\n\n, join with other tables, all in one round-trip**When to consider a dedicated vector DB:**\n\n| Signal | Consider moving to |\n|---|---|\n| > 10M documents | Pinecone, Weaviate |\n| Multi-modal search (text + image) | Weaviate, Qdrant |\n| Managed cloud with SLA | Pinecone |\n| On-premise, full control | Qdrant |\n\nFor most enterprise RAG applications at typical document volumes, pgvector is the right starting point. Migrate when you hit actual limits, not anticipated ones.\n\n`gemini-embedding-001`\n\noutputs 3072 dimensions by default. We set `output_dimensionality=768`\n\n.\n\n**The constraint:** pgvector's HNSW index has a hard limit of 2000 dimensions.\n\n**Why not 2000?** We chose 768 because:\n\n**Dimension vs. quality trade-off:**\n\n| Dimensions | Index build | Storage | Retrieval quality |\n|---|---|---|---|\n| 256 | Fastest | Smallest | Noticeably lower |\n| 768 | Fast | Small | Near full quality |\n| 1536 | Moderate | Moderate | Full quality |\n| 3072 | Slow | Largest | Full quality (no HNSW) |\n\n`task_type`\n\nWe used different `task_type`\n\nvalues for ingestion and querying:\n\n```\n# Ingestion\nconfig=types.EmbedContentConfig(task_type=\"RETRIEVAL_DOCUMENT\", ...)\n\n# Query\nconfig=types.EmbedContentConfig(task_type=\"RETRIEVAL_QUERY\", ...)\n```\n\n**Why this matters:** Gemini's embedding model is trained with asymmetric objectives. A document and a query about the same topic are represented differently in embedding space — the model learns to map queries *toward* relevant documents, not to the same point. Using the same task type for both degrades retrieval accuracy.\n\nThis is analogous to how you'd phrase a document differently from a search query in natural language: \"F1 Score is the harmonic mean of Precision and Recall\" (document) vs. \"how to calculate F1\" (query).\n\npgvector supports two index types. We chose HNSW.\n\n| HNSW | IVFFlat | |\n|---|---|---|\n| Query speed | Fast | Moderate |\n| Build time | Moderate | Fast |\n| Memory | Higher | Lower |\n| Accuracy at scale | Higher | Lower |\n| Requires training data | No | Yes (needs `VACUUM` after inserts) |\n\nHNSW is the better default for production. IVFFlat is worth considering only when you have very tight memory constraints and can afford slower queries.\n\n**HNSW parameter guide:**\n\n```\nWITH (\n    m = 16,              -- max connections per node\n    ef_construction = 64 -- search width during build\n)\n```\n\n`m`\n\n: higher = better recall, more memory. Range: 4–64. Default 16 works for most cases.`ef_construction`\n\n: higher = better index quality, slower build. Range: 16–512. Default 64 is a good production starting point.We used `gemini-2.5-flash`\n\nrather than the more capable `gemini-opus`\n\nmodels.\n\n**Reasoning:**\n\n**When to upgrade the generation model:**\n\n**When to upgrade the embedding model:**\n\nThe embedding model matters more for retrieval quality. The generation model matters more for answer quality. Optimize them independently.\n\nThis architecture scales predictably:\n\n```\nPhase 1 (now): pgvector local → works to ~1M docs\nPhase 2:       pgvector + Supabase → managed PostgreSQL, easy scaling\nPhase 3:       pgvector + read replicas → horizontal query scaling\nPhase 4:       Dedicated vector DB → if you genuinely outgrow pgvector\n```\n\nMost teams never reach Phase 4. Start at Phase 1, move when you have evidence you need to.\n\n**Chunking strategy matters more than model choice.** If your documents are long (PDFs, reports), how you split them into chunks dramatically affects retrieval quality. A naive split at 512 tokens often breaks context mid-sentence. Consider semantic chunking or overlap.\n\n**Don't embed the question alone.** For complex questions, consider HyDE (Hypothetical Document Embedding): generate a hypothetical answer to the question, embed that, then search. This often retrieves better documents than embedding the raw question.\n\n**Reranking improves precision.** After vector search returns top-k candidates, a cross-encoder reranker (like Cohere Rerank) re-scores them for precision. Add this when recall is good but final answer quality is inconsistent.\n\nIn the next article, we'll give the LLM the ability to call these search functions autonomously using Tool Use.\n\n*Full source code: github.com/qameqame/pgvector-tutorial*", "url": "https://wpnews.pro/news/building-a-rag-system-from-scratch-design-decisions-explained", "canonical_source": "https://dev.to/hiroki-kameyama/building-a-rag-system-from-scratch-design-decisions-explained-40hd", "published_at": "2026-06-27 22:08:41+00:00", "updated_at": "2026-06-27 22:35:58.784508+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "developer-tools", "machine-learning", "ai-infrastructure"], "entities": ["pgvector", "Gemini", "Gemini 2.5 Flash", "Supabase", "Pinecone", "Weaviate", "Qdrant", "PostgreSQL"], "alternates": {"html": "https://wpnews.pro/news/building-a-rag-system-from-scratch-design-decisions-explained", "markdown": "https://wpnews.pro/news/building-a-rag-system-from-scratch-design-decisions-explained.md", "text": "https://wpnews.pro/news/building-a-rag-system-from-scratch-design-decisions-explained.txt", "jsonld": "https://wpnews.pro/news/building-a-rag-system-from-scratch-design-decisions-explained.jsonld"}}