Building a RAG System from Scratch — Design Decisions Explained

wpnews.pro

cd /news/large-language-models/building-a-rag-system-from-scratch-d… · home › topics › large-language-models › article

[ARTICLE · art-42116] src=dev.to ↗ pub=2026-06-27T22:08Z topic=large-language-models verified=true sentiment=· neutral

Building a RAG System from Scratch — Design Decisions Explained

A developer built a RAG system from scratch using pgvector and Gemini embeddings, explaining design decisions such as choosing pgvector over dedicated vector databases, using 768-dimensional embeddings, and employing different task types for ingestion and querying. The system uses HNSW indexing and Gemini 2.5 Flash for answer generation, with a scaling plan from local pgvector to managed cloud solutions.

read4 min views1 publishedJun 27, 2026

In the previous article, we built a working RAG pipeline. Now let's step back and ask why we made each design decision — and what alternatives exist when your requirements change.

Here's what we built:

Ingest phase
  Text → gemini-embedding-001 (RETRIEVAL_DOCUMENT, 768 dims)
       → pgvector (HNSW index, cosine similarity)

Query phase
  Question → gemini-embedding-001 (RETRIEVAL_QUERY, 768 dims)
           → pgvector search (top-k)
           → Gemini 2.5 Flash (answer generation)

Every element in this diagram was a choice. Let's examine each one.

We used pgvector, a PostgreSQL extension, rather than a purpose-built vector database like Pinecone, Weaviate, or Qdrant.

Why pgvector works here:

category

, join with other tables, all in one round-tripWhen to consider a dedicated vector DB:

Signal	Consider moving to
> 10M documents	Pinecone, Weaviate
Multi-modal search (text + image)	Weaviate, Qdrant
Managed cloud with SLA	Pinecone
On-premise, full control	Qdrant

For most enterprise RAG applications at typical document volumes, pgvector is the right starting point. Migrate when you hit actual limits, not anticipated ones.

gemini-embedding-001

outputs 3072 dimensions by default. We set output_dimensionality=768

The constraint: pgvector's HNSW index has a hard limit of 2000 dimensions.

Why not 2000? We chose 768 because:

Dimension vs. quality trade-off:

Dimensions	Index build	Storage	Retrieval quality
256	Fastest	Smallest	Noticeably lower
768	Fast	Small	Near full quality
1536	Moderate	Moderate	Full quality
3072	Slow	Largest	Full quality (no HNSW)

task_type

We used different task_type

values for ingestion and querying:

config=types.EmbedContentConfig(task_type="RETRIEVAL_DOCUMENT", ...)

config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY", ...)

Why this matters: Gemini's embedding model is trained with asymmetric objectives. A document and a query about the same topic are represented differently in embedding space — the model learns to map queries toward relevant documents, not to the same point. Using the same task type for both degrades retrieval accuracy.

This is analogous to how you'd phrase a document differently from a search query in natural language: "F1 Score is the harmonic mean of Precision and Recall" (document) vs. "how to calculate F1" (query).

pgvector supports two index types. We chose HNSW.

HNSW	IVFFlat
Query speed	Fast	Moderate
Build time	Moderate	Fast
Memory	Higher	Lower
Accuracy at scale	Higher	Lower
Requires training data	No	Yes (needs `VACUUM` after inserts)

HNSW is the better default for production. IVFFlat is worth considering only when you have very tight memory constraints and can afford slower queries.

HNSW parameter guide:

WITH (
    m = 16,              -- max connections per node
    ef_construction = 64 -- search width during build
)

m

: higher = better recall, more memory. Range: 4–64. Default 16 works for most cases.ef_construction

: higher = better index quality, slower build. Range: 16–512. Default 64 is a good production starting point.We used gemini-2.5-flash

rather than the more capable gemini-opus

models.

Reasoning:

When to upgrade the generation model:

When to upgrade the embedding model:

The embedding model matters more for retrieval quality. The generation model matters more for answer quality. Optimize them independently.

This architecture scales predictably:

Phase 1 (now): pgvector local → works to ~1M docs
Phase 2:       pgvector + Supabase → managed PostgreSQL, easy scaling
Phase 3:       pgvector + read replicas → horizontal query scaling
Phase 4:       Dedicated vector DB → if you genuinely outgrow pgvector

Most teams never reach Phase 4. Start at Phase 1, move when you have evidence you need to.

Chunking strategy matters more than model choice. If your documents are long (PDFs, reports), how you split them into chunks dramatically affects retrieval quality. A naive split at 512 tokens often breaks context mid-sentence. Consider semantic chunking or overlap.

Don't embed the question alone. For complex questions, consider HyDE (Hypothetical Document Embedding): generate a hypothetical answer to the question, embed that, then search. This often retrieves better documents than embedding the raw question.

Reranking improves precision. After vector search returns top-k candidates, a cross-encoder reranker (like Cohere Rerank) re-scores them for precision. Add this when recall is good but final answer quality is inconsistent.

In the next article, we'll give the LLM the ability to call these search functions autonomously using Tool Use.

Full source code: github.com/qameqame/pgvector-tutorial

source & further reading

dev.to — original article Building a RAG System from Scratch with pgvector and Gemini — Introduction Agents Are Learning to Write Their Own SKILL.md Files Inside An AI Agent: Planning, Tool Use, Memory, Constraints, And Verification

~/api · this article 200

$curl api.wpnews.pro/v1/news/building-a-rag-system-fr…

Read original on dev.to → dev.to/hiroki-kameyama/building-a-rag-system-fro…

mentioned entities

pgvector

Gemini

Gemini 2.5 Flash

Supabase

Pinecone

Weaviate

Qdrant

PostgreSQL

metadata

slugbuilding-a-rag-system-from-scratch-design-decisions-explained

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevStitching Korean spirit: Costume…

next →Building a RAG System from Scrat…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 27 Jun · #large-language-models

Building a RAG System from Scratch — Wrap-up and What Comes Next

dev.to · 27 Jun · #large-language-models

Building a RAG System from Scratch with pgvector and Gemini — Introduction

dev.to · 27 Jun · #large-language-models

SQL + AI: Real-World Database Solutions You Can Use Today

dev.to · 27 Jun · #large-language-models

Building a RAG System from Scratch — Tool Use: Let the LLM Search Autonomously

── more on @pgvector 3 stories trending now

wpnews · 25 May · #artificial-intelligence

Maia-3: free and open source

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 1 Nov · #developer-tools

Custom Zig Test Runner, better ouput, timing display, and support for special "tests:beforeAll" and "tests:afterAll" tests

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required