Vector Search in Elasticsearch: From Keywords to Meaning - Building Semantic Search and RAG Pipelines

wpnews.pro

You type "k8s deployment troubleshooting" into your documentation search. The top result is a page about Kubernetes architecture that never mentions the word "troubleshooting." It is exactly what you need. BM25 would have missed it entirely.

This is the promise of vector search: finding documents by meaning, not just matching words. In 2025 and 2026, vector search has moved from niche ML engineering to a core Elasticsearch capability. If you are building search for AI applications - RAG pipelines, semantic Q&A, recommendation systems - understanding how Elasticsearch handles vectors is no longer optional.

I have spent the past year building RAG pipelines at Cloudera, and I have learned that vector search is powerful but easy to misuse. This post covers what works, what does not, and how to implement it in production.

BM25, which we covered in a previous post, is brilliant at matching exact terms. But it is fundamentally lexical. It does not understand that:

Vector search solves this by converting text into high-dimensional numerical vectors (embeddings) where semantically similar content lives close together in vector space. A query for "k8s deployment troubleshooting" gets embedded into a vector, and Elasticsearch finds the nearest document vectors - even if they do not share a single keyword.

But vector search is not a replacement for BM25. It is a complement.

BM25 is faster, requires no ML infrastructure, and excels at exact-term matching. Vector search is slower, requires embedding models, and shines at conceptual similarity. The best search systems in 2026 use both.

Elasticsearch introduced the dense_vector

field type in version 7.x and has dramatically improved it through 8.x and into 2026. Here is how it works under the hood.

A dense vector is simply an array of floating-point numbers. A 768-dimensional embedding from a model like E5 looks like this:

[0.023, -0.156, 0.089, ..., 0.041]  // 768 numbers total

In your index mapping:

PUT /products
{
  "mappings": {
    "properties": {
      "name": { "type": "text" },
      "description": { "type": "text" },
      "description_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

Key parameters:

dims

: The vector dimension (must match your embedding model)index

: Whether to build an ANN index (set to true

for search, false

if only storing)similarity

: Distance metric - l2_norm

(Euclidean), dot_product

, or cosine

Elasticsearch uses HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor (ANN) search. HNSW builds a multi-layer graph where:

HNSW is fast (sub-10ms for million-vector indexes) but approximate. It may miss the true nearest neighbor in exchange for speed. You can tune this trade-off:

PUT /products
{
  "mappings": {
    "properties": {
      "description_vector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine",
        "index_options": {
          "type": "hnsw",
          "m": 16,
          "ef_construction": 100
        }
      }
    }
  }
}

m

: Number of bi-directional links per node (higher = more accurate, more memory)ef_construction

: Search depth during index building (higher = better graph quality, slower indexing)For query-time accuracy tuning, use num_candidates

:

GET /products/_search
{
  "knn": {
    "field": "description_vector",
    "query_vector": [0.023, -0.156, ...],
    "k": 10,
    "num_candidates": 100
  }
}

num_candidates

is how many vectors Elasticsearch considers before returning the top k

. Higher values improve recall but increase latency. A common rule: num_candidates

should be 10x k

for good recall.

Retrieval-Augmented Generation (RAG) is the dominant architecture for grounding LLMs in private data. The pipeline looks like this:

Document -> Chunk -> Embed -> Index -> Query -> Retrieve -> Generate

Here is how to implement it in Elasticsearch.

LLMs have context limits, so long documents must be split into chunks. A common approach:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_text(long_document)

Chunk size depends on your embedding model. E5 and BGE models typically use 512 tokens. OpenAI text-embedding-3-large supports up to 8192 tokens.

You need an embedding model. Options in 2026:

Model	Type	Dimensions	Best For
ELSER v2	Sparse (learned)	~2000 terms	Built-in, no external service
multilingual-e5-large	Dense	1024	Cross-lingual, high quality
BGE-large-en-v1.5	Dense	1024	Open source, competitive
OpenAI text-embedding-3-large	Dense	3072	Highest quality, API cost

For dense vectors with a local model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(intfloat/multilingual-e5-large)
embeddings = model.encode(chunks, normalize_embeddings=True)

Important: Use normalize_embeddings=True

if you are using dot_product

similarity. Elasticsearch can then skip the normalization step internally for faster searches.

POST /_bulk
{ "index": { "_index": "knowledge_base", "_id": "doc_1_chunk_0" } }
{ "content": "To troubleshoot Kubernetes deployments...", "source_doc": "k8s_guide.pdf", "category": "devops", "content_vector": [0.023, -0.156, ...] }
GET /knowledge_base/_search
{
  "knn": {
    "field": "content_vector",
    "query_vector": [0.041, 0.089, ...],
    "k": 5,
    "num_candidates": 50
  }
}

The query vector is the embedding of the user question: "How do I fix a Kubernetes deployment that will not start?"

Pure vector search has a problem: it misses exact matches. If a user searches for "Error code 503," a vector search might return documents about "server errors" in general but miss the exact troubleshooting page for HTTP 503.

The solution is hybrid search: run BM25 and kNN in parallel, then merge results.

Elasticsearch 8.15+ provides the retrievers

API for this:

GET /knowledge_base/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "k8s deployment troubleshooting",
                "fields": ["content", "title"]
              }
            }
          }
        },
        {
          "knn": {
            "field": "content_vector",
            "query_vector": [0.041, 0.089, ...],
            "k": 10,
            "num_candidates": 100
          }
        }
      ],
      "rank_constant": 60,
      "window_size": 50
    }
  }
}

RRF (Reciprocal Rank Fusion) combines rankings without normalizing scores (which are incomparable across BM25 and cosine similarity). The formula is simple:

rrf_score = sum(1 / (rank + k))

Where k

(rank_constant, default 60) prevents top ranks from dominating. Documents that rank well in both retrievers bubble to the top.

This is the architecture behind modern RAG systems. BM25 ensures exact matches surface. Vector search ensures conceptual matches surface. RRF merges them intelligently.

Standalone vector databases (Pinecone, Weaviate) are great at pure vector search. But Elasticsearch has an advantage: you can combine vector search with the full power of Elasticsearch filtering, aggregations, and text search in a single query.

Example: Find semantically similar products, but only in the "electronics" category, with price under $500, and in stock:

GET /products/_search
{
  "knn": {
    "field": "description_vector",
    "query_vector": [0.041, 0.089, ...],
    "k": 10,
    "num_candidates": 100,
    "filter": {
      "bool": {
        "must": [
          { "term": { "category": "electronics" } },
          { "range": { "price": { "lte": 500 } } },
          { "term": { "in_stock": true } }
        ]
      }
    }
  }
}

Elasticsearch applies the filter during the HNSW graph traversal (post-filtering), so only matching vectors are considered. This is faster than retrieving vectors and filtering afterward.

This pattern - semantic similarity + structured filters - is why many teams choose Elasticsearch over dedicated vector databases. You get vectors AND the query DSL you already know.

Not every team wants to run an embedding model. Elasticsearch provides ELSER (Elastic Learned Sparse EncodeR), a built-in model that generates sparse vectors using term expansion.

ELSER works differently from dense vectors:

PUT /_ingest/pipeline/elser_pipeline
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_2",
        "input_output": [
          { "input_field": "content", "output_field": "content_embedding" }
        ]
      }
    }
  ]
}

ELSER v2 (released in 2024) is competitive with dense embedding models for English text. For multilingual or domain-specific content, custom dense models still win. But for teams that want semantic search with zero ML infrastructure, ELSER is the fastest path.

Vector search is not free. Here is what you need to plan for.

Vectors are memory-hungry. A single 768-dimensional float32 vector uses:

768 dimensions * 4 bytes = 3 KB per vector

One million vectors = 3 GB. Plus HNSW graph overhead (roughly 2x the vector memory). For 10 million vectors at 768 dimensions, expect 30-60 GB of memory.

Mitigations:

int8

quantization (available in Elasticsearch 8.13+): 768 dimensions * 1 byte = 768 bytes per vector. 4x memory reduction with minimal quality loss.The bottleneck is usually embedding generation, not Elasticsearch indexing. A local GPU can generate 100-500 embeddings/second. CPU inference might manage 10-50/second.

Mitigations:

ANN search with HNSW is fast but not as fast as BM25. Expect:

For user-facing search, this is usually acceptable. For high-throughput batch pipelines, consider pre-filtering with metadata to reduce the vector search space.

If your users search by exact product SKUs, error codes, or names, vector search adds latency and complexity with no benefit. Start with BM25. Add vectors when you see queries where keyword matching fails.

cosine

: Best for semantic similarity when vector magnitude does not matter (most text embeddings)dot_product

: Best when vectors are normalized and you want speed (skip the cosine calculation)l2_norm

: Best when vector magnitude carries signal (less common for text)Using l2_norm

with unnormalized text embeddings will give poor results. Check your model documentation.

HNSW trades recall for speed. With default settings, expect 95-99% recall@10 (the true top-10 result is in the returned top-10 95-99% of the time). If your use case requires 100% recall, use exact brute-force search (index: false

with script_score

) or increase num_candidates

significantly.

Track recall@k in production. If it drops below your threshold, increase num_candidates

or ef_construction

. Do not deploy vector search without measuring whether it finds the right documents.

Vector fields are indexed during the refresh cycle, just like text. If you index documents and search immediately, you might not find them. For real-time RAG where documents are ingested and immediately queried, ensure your refresh interval is appropriate (default 1s, or use ?refresh=true

for testing).

Use Case	Primary Approach	Secondary/Tuning
Exact keyword search (SKUs, codes)	BM25 only	No vectors needed
Semantic Q&A ("how do I...")	Dense vectors + kNN	Hybrid with BM25 for exact matches
Product search with filters	Hybrid (BM25 + kNN + metadata filters)	RRF for ranking, filters for pruning
Cross-lingual search	multilingual-e5 or BGE dense vectors	BM25 for exact term fallback
Zero-ML-infrastructure team	ELSER sparse vectors	Built-in inference, no external model
High-volume log search	BM25 with filters	Vectors only for semantic anomaly detection
Documentation/knowledge base	Hybrid search (RRF)	Vectors for conceptual, BM25 for exact

Vector search in Elasticsearch has matured from an experimental feature to a production-ready capability. The combination of dense vectors, HNSW indexing, hybrid search with RRF, and metadata filtering makes Elasticsearch a compelling platform for semantic search and RAG pipelines.

The key takeaways:

num_candidates

and monitor recall@k for your use case.The next time someone asks whether Elasticsearch can handle AI-powered search, the answer is yes - and it can do so while preserving everything that makes Elasticsearch powerful: distributed scale, rich querying, and operational maturity.

I am Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I build search systems, data pipelines, and the occasional distributed system. Follow my work on GitHub: https://github.com/iprithv

source & further reading

dev.to — original article 39 days of an autonomous AI company: 487M tokens, $1,117 of model spend, $0 in revenue Why waiting longer makes voice AI worse Best AI Code Review Tools for GitHub in 2026

Vector Search in Elasticsearch: From Keywords to Meaning - Building Semantic Search and RAG Pipelines

Run your AI side-project on zahid.host