Vector Search in Elasticsearch: From Keywords to Meaning - Building Semantic Search and RAG Pipelines

A Cloudera engineer describes how vector search in Elasticsearch enables semantic search and RAG pipelines by finding documents based on meaning rather than exact keywords. The post explains that vector search complements BM25 and covers practical implementation details including HNSW indexing and query tuning for production use.

You type "k8s deployment troubleshooting" into your documentation search. The top result is a page about Kubernetes architecture that never mentions the word "troubleshooting." It is exactly what you need. BM25 would have missed it entirely. This is the promise of vector search: finding documents by meaning, not just matching words. In 2025 and 2026, vector search has moved from niche ML engineering to a core Elasticsearch capability. If you are building search for AI applications - RAG pipelines, semantic Q&A, recommendation systems - understanding how Elasticsearch handles vectors is no longer optional. I have spent the past year building RAG pipelines at Cloudera, and I have learned that vector search is powerful but easy to misuse. This post covers what works, what does not, and how to implement it in production. BM25, which we covered in a previous post, is brilliant at matching exact terms. But it is fundamentally lexical. It does not understand that: Vector search solves this by converting text into high-dimensional numerical vectors embeddings where semantically similar content lives close together in vector space. A query for "k8s deployment troubleshooting" gets embedded into a vector, and Elasticsearch finds the nearest document vectors - even if they do not share a single keyword. But vector search is not a replacement for BM25. It is a complement. BM25 is faster, requires no ML infrastructure, and excels at exact-term matching. Vector search is slower, requires embedding models, and shines at conceptual similarity. The best search systems in 2026 use both. Elasticsearch introduced the dense vector field type in version 7.x and has dramatically improved it through 8.x and into 2026. Here is how it works under the hood. A dense vector is simply an array of floating-point numbers. A 768-dimensional embedding from a model like E5 looks like this: 0.023, -0.156, 0.089, ..., 0.041 // 768 numbers total In your index mapping: PUT /products { "mappings": { "properties": { "name": { "type": "text" }, "description": { "type": "text" }, "description vector": { "type": "dense vector", "dims": 768, "index": true, "similarity": "cosine" } } } } Key parameters: dims : The vector dimension must match your embedding model index : Whether to build an ANN index set to true for search, false if only storing similarity : Distance metric - l2 norm Euclidean , dot product , or cosine Elasticsearch uses HNSW Hierarchical Navigable Small World for approximate nearest neighbor ANN search. HNSW builds a multi-layer graph where: HNSW is fast sub-10ms for million-vector indexes but approximate. It may miss the true nearest neighbor in exchange for speed. You can tune this trade-off: PUT /products { "mappings": { "properties": { "description vector": { "type": "dense vector", "dims": 768, "index": true, "similarity": "cosine", "index options": { "type": "hnsw", "m": 16, "ef construction": 100 } } } } } m : Number of bi-directional links per node higher = more accurate, more memory ef construction : Search depth during index building higher = better graph quality, slower indexing For query-time accuracy tuning, use num candidates : GET /products/ search { "knn": { "field": "description vector", "query vector": 0.023, -0.156, ... , "k": 10, "num candidates": 100 } } num candidates is how many vectors Elasticsearch considers before returning the top k . Higher values improve recall but increase latency. A common rule: num candidates should be 10x k for good recall. Retrieval-Augmented Generation RAG is the dominant architecture for grounding LLMs in private data. The pipeline looks like this: Document - Chunk - Embed - Index - Query - Retrieve - Generate Here is how to implement it in Elasticsearch. LLMs have context limits, so long documents must be split into chunks. A common approach: python from langchain.text splitter import RecursiveCharacterTextSplitter text splitter = RecursiveCharacterTextSplitter chunk size=512, chunk overlap=50, separators= "\n\n", "\n", ". ", " ", "" chunks = text splitter.split text long document Chunk size depends on your embedding model. E5 and BGE models typically use 512 tokens. OpenAI text-embedding-3-large supports up to 8192 tokens. You need an embedding model. Options in 2026: | Model | Type | Dimensions | Best For | |---|---|---|---| | ELSER v2 | Sparse learned | ~2000 terms | Built-in, no external service | | multilingual-e5-large | Dense | 1024 | Cross-lingual, high quality | | BGE-large-en-v1.5 | Dense | 1024 | Open source, competitive | | OpenAI text-embedding-3-large | Dense | 3072 | Highest quality, API cost | For dense vectors with a local model: python from sentence transformers import SentenceTransformer model = SentenceTransformer intfloat/multilingual-e5-large embeddings = model.encode chunks, normalize embeddings=True Important : Use normalize embeddings=True if you are using dot product similarity. Elasticsearch can then skip the normalization step internally for faster searches. POST / bulk { "index": { " index": "knowledge base", " id": "doc 1 chunk 0" } } { "content": "To troubleshoot Kubernetes deployments...", "source doc": "k8s guide.pdf", "category": "devops", "content vector": 0.023, -0.156, ... } GET /knowledge base/ search { "knn": { "field": "content vector", "query vector": 0.041, 0.089, ... , "k": 5, "num candidates": 50 } } The query vector is the embedding of the user question: "How do I fix a Kubernetes deployment that will not start?" Pure vector search has a problem: it misses exact matches. If a user searches for "Error code 503," a vector search might return documents about "server errors" in general but miss the exact troubleshooting page for HTTP 503. The solution is hybrid search: run BM25 and kNN in parallel, then merge results. Elasticsearch 8.15+ provides the retrievers API for this: GET /knowledge base/ search { "retriever": { "rrf": { "retrievers": { "standard": { "query": { "multi match": { "query": "k8s deployment troubleshooting", "fields": "content", "title" } } } }, { "knn": { "field": "content vector", "query vector": 0.041, 0.089, ... , "k": 10, "num candidates": 100 } } , "rank constant": 60, "window size": 50 } } } RRF Reciprocal Rank Fusion combines rankings without normalizing scores which are incomparable across BM25 and cosine similarity . The formula is simple: rrf score = sum 1 / rank + k Where k rank constant, default 60 prevents top ranks from dominating. Documents that rank well in both retrievers bubble to the top. This is the architecture behind modern RAG systems. BM25 ensures exact matches surface. Vector search ensures conceptual matches surface. RRF merges them intelligently. Standalone vector databases Pinecone, Weaviate are great at pure vector search. But Elasticsearch has an advantage: you can combine vector search with the full power of Elasticsearch filtering, aggregations, and text search in a single query. Example: Find semantically similar products, but only in the "electronics" category, with price under $500, and in stock: GET /products/ search { "knn": { "field": "description vector", "query vector": 0.041, 0.089, ... , "k": 10, "num candidates": 100, "filter": { "bool": { "must": { "term": { "category": "electronics" } }, { "range": { "price": { "lte": 500 } } }, { "term": { "in stock": true } } } } } } Elasticsearch applies the filter during the HNSW graph traversal post-filtering , so only matching vectors are considered. This is faster than retrieving vectors and filtering afterward. This pattern - semantic similarity + structured filters - is why many teams choose Elasticsearch over dedicated vector databases. You get vectors AND the query DSL you already know. Not every team wants to run an embedding model. Elasticsearch provides ELSER Elastic Learned Sparse EncodeR , a built-in model that generates sparse vectors using term expansion. ELSER works differently from dense vectors: PUT / ingest/pipeline/elser pipeline { "processors": { "inference": { "model id": ".elser model 2", "input output": { "input field": "content", "output field": "content embedding" } } } } ELSER v2 released in 2024 is competitive with dense embedding models for English text. For multilingual or domain-specific content, custom dense models still win. But for teams that want semantic search with zero ML infrastructure, ELSER is the fastest path. Vector search is not free. Here is what you need to plan for. Vectors are memory-hungry. A single 768-dimensional float32 vector uses: 768 dimensions 4 bytes = 3 KB per vector One million vectors = 3 GB. Plus HNSW graph overhead roughly 2x the vector memory . For 10 million vectors at 768 dimensions, expect 30-60 GB of memory. Mitigations: int8 quantization available in Elasticsearch 8.13+ : 768 dimensions 1 byte = 768 bytes per vector. 4x memory reduction with minimal quality loss.The bottleneck is usually embedding generation, not Elasticsearch indexing. A local GPU can generate 100-500 embeddings/second. CPU inference might manage 10-50/second. Mitigations: ANN search with HNSW is fast but not as fast as BM25. Expect: For user-facing search, this is usually acceptable. For high-throughput batch pipelines, consider pre-filtering with metadata to reduce the vector search space. If your users search by exact product SKUs, error codes, or names, vector search adds latency and complexity with no benefit. Start with BM25. Add vectors when you see queries where keyword matching fails. cosine : Best for semantic similarity when vector magnitude does not matter most text embeddings dot product : Best when vectors are normalized and you want speed skip the cosine calculation l2 norm : Best when vector magnitude carries signal less common for text Using l2 norm with unnormalized text embeddings will give poor results. Check your model documentation. HNSW trades recall for speed. With default settings, expect 95-99% recall@10 the true top-10 result is in the returned top-10 95-99% of the time . If your use case requires 100% recall, use exact brute-force search index: false with script score or increase num candidates significantly. Track recall@k in production. If it drops below your threshold, increase num candidates or ef construction . Do not deploy vector search without measuring whether it finds the right documents. Vector fields are indexed during the refresh cycle, just like text. If you index documents and search immediately, you might not find them. For real-time RAG where documents are ingested and immediately queried, ensure your refresh interval is appropriate default 1s, or use ?refresh=true for testing . | Use Case | Primary Approach | Secondary/Tuning | |---|---|---| | Exact keyword search SKUs, codes | BM25 only | No vectors needed | | Semantic Q&A "how do I..." | Dense vectors + kNN | Hybrid with BM25 for exact matches | | Product search with filters | Hybrid BM25 + kNN + metadata filters | RRF for ranking, filters for pruning | | Cross-lingual search | multilingual-e5 or BGE dense vectors | BM25 for exact term fallback | | Zero-ML-infrastructure team | ELSER sparse vectors | Built-in inference, no external model | | High-volume log search | BM25 with filters | Vectors only for semantic anomaly detection | | Documentation/knowledge base | Hybrid search RRF | Vectors for conceptual, BM25 for exact | Vector search in Elasticsearch has matured from an experimental feature to a production-ready capability. The combination of dense vectors, HNSW indexing, hybrid search with RRF, and metadata filtering makes Elasticsearch a compelling platform for semantic search and RAG pipelines. The key takeaways: num candidates and monitor recall@k for your use case.The next time someone asks whether Elasticsearch can handle AI-powered search, the answer is yes - and it can do so while preserving everything that makes Elasticsearch powerful: distributed scale, rich querying, and operational maturity. I am Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I build search systems, data pipelines, and the occasional distributed system. Follow my work on GitHub: https://github.com/iprithv https://github.com/iprithv