Are Vector Databases Enough for Modern AI Workloads? Y/N

Zilliz launched Vector Lakebase, an evolution of its cloud platform from a pure vector database into a lake-native data foundation for AI workloads. The company confirmed it is not moving away from vector databases, arguing that the technology remains essential but that enterprises now require broader infrastructure capable of improving, reorganizing, and analyzing unstructured data for continuously running AI agents.

Why We Built Vector Lakebase: Rethinking Unstructured Data Architecture for AI Recently, we launched Zilliz Vector Lakebase https://zilliz.com/blog/from-vector-database-to-vector-lakebase , the next evolution of Zilliz Cloud from a pure vector database system into a unified, lake-native data foundation for AI workloads. The announcement got a lot of interest. It also surfaced questions almost immediately about where Zilliz was headed. Was Zilliz stepping away from vector databases? Or, put more directly: are vector databases already becoming obsolete? I understand why these questions came up. For years, Zilliz has been known for building production-ready vector database systems open-source Milvus https://github.com/milvus-io/milvus and fully managed Zilliz Cloud https://zilliz.com/ . So when we started talking about evolving to a lake-native data foundation for AI, some people naturally wondered whether this meant a change in direction. The short answer is NO. Absolutely NOT.If anything, Vector Lakebase is our answer to what happens after vector databases succeed. Over the past several years, vector databases have become one of the foundational infrastructure layers of the AI stack. Adoption has grown faster than we could have imagined when we started Milvus nearly a decade ago. The category is real, and the need for semantic retrieval is only becoming more important. But something else has become clear to us as well: vector retrieval is no longer the whole problem. As AI systems move from static assistants into continuously running agents, enterprises are asking for something broader from their unstructured data infrastructure. They do not just want a system that can retrieve information. They want a system that can improve the data, reorganize it, analyze it, refine it, and feed those improvements back into production. That changes the architecture. That shift reminds me of an earlier cycle in infrastructure history: the evolution of databases during the mobile internet era. The details are different, but the pattern is familiar. A new kind of application creates a new kind of data pressure. The first generation of infrastructure solves the immediate serving problem. Then, as the data grows, the architecture has to expand. I think vector databases are entering that next stage now. Mobile internet already went through this cycle once Around 2010, as mobile applications exploded, MongoDB became one of the defining infrastructure products of that period. The reason was straightforward. Mobile applications generated massive amounts of semi-structured data: user events, social activity, device telemetry, behavioral signals, product logs. None of these fit neatly into the relational database patterns most teams were using at the time. Product teams were shipping fast, schemas were changing constantly, and the first problem was simply to accept the data without slowing the application down. MongoDB solved that immediate problem very well: ingest the data first. Structure and analysis could come later. Several years later, the industry started asking a different question. Once all this data existed, how could businesses actually use it? That shift helped drive the rise of modern data warehouses such as Snowflake and Redshift. The focus moved from operational storage to analytical insight. Companies wanted BI reports, user cohorts, attribution, forecasting, and growth analysis. Data stopped being only an operational byproduct and became a business asset. Then another bottleneck emerged. The divide between transactional systems and analytical systems became increasingly painful. Data pipelines between OLTP and OLAP environments were fragile, expensive, and operationally exhausting. The same datasets were copied repeatedly across systems, often with synchronization delays and subtle inconsistencies. That was the environment that gave rise to the Lakehouse architecture . Databricks, Iceberg, Hudi, and related systems all converged around the same basic idea: a single logical copy of data should support multiple computation models without requiring endless movement between systems. Looking back, the progression feels almost inevitable. But at the time, none of it was obvious. MongoDB’s rise did not predict Snowflake. Snowflake did not predict the Lakehouse. Each transition emerged because the previous generation of infrastructure succeeded at scale and then exposed a new class of constraints. That pattern matters because AI infrastructure increasingly feels like it is following a similar path. Retrieval solved the first problem, not the final one When large language models broke into mainstream adoption in 2023, vector databases became one of the earliest breakout infrastructure categories. The reason was practical. RAG systems needed a native way to store embeddings and perform semantic retrieval. Most traditional databases were not designed for high-dimensional vector search, ANN indexes, hybrid retrieval, and low-latency filtering at scale. In many ways, vector databases solved the same kind of problem MongoDB solved earlier. A new application pattern created a new data abstraction, and developers needed infrastructure that could support it. This time, the abstraction was semantic representation: embeddings generated from unstructured data by neural models. That first phase of adoption happened very quickly. But only a few years later, the questions we hear from customers have become much more complex. They no longer ask only how to retrieve vectors efficiently. They ask: - How do we continuously deduplicate and refine training data? - How do we analyze billions of embeddings for clustering and quality issues? - How do we identify drift, bias, or redundancy in multimodal datasets? - How do we trace and optimize agent execution histories? - How do we reprocess and improve data as models evolve? - How do we search cold data without keeping all compute running all the time? - How do we use data that already lives in Iceberg, Lance, Parquet, and object storage for multiple AI workloads? These are no longer purely retrieval problems. They require large-scale offline processing, iterative discovery workflows, data governance, analytical exploration, and continuous feedback loops between online systems and offline computation. Increasingly, we noticed something important among advanced AI teams: the bottleneck was no longer just model capability. It was iteration speed. One experience made this painfully obvious. We saw teams trying to reprocess large vector datasets: reclustering embeddings, removing duplication, regenerating indexes, re-embedding entire corpora. In some cases, simply moving a billion vectors from one system to another could take days. Not hours. Days. Meanwhile, the iteration cycles inside leading AI teams are moving in the opposite direction. Researchers want to experiment continuously. Data engineers are under pressure to clean, evaluate, and refresh datasets faster. Models improve. Embedding models change. Agents create new traces every day. But the infrastructure stack underneath them was not designed for continuous refinement loops on unstructured data. That was the point where we started to think the industry was framing the problem too narrowly. Unstructured data infrastructure is not merely a retrieval layer. It is becoming a continuously operating system. From retrieval systems to continuous systems: CS/CD Internally, we began describing this architecture as a continuous loop between serving and discovery. Over time, we started calling it CS/CD: Continuous Serving and Continuous Discovery. The idea is conceptually simple. On one side, there is the serving layer: high-throughput ingestion, low-latency retrieval for online RAG systems, recommendation systems, personalization, AI memory, and real-time agents. On the other side, there is the discovery layer: clustering, deduplication, re-embedding, offline evaluation, quality analysis, model fine-tuning preparation, and agent trajectory analysis. The important point is that these are not independent workflows. They form a flywheel. Serving systems continuously generate feedback and new data. Discovery systems analyze and improve that data. The resulting improvements, including better embeddings, cleaner datasets, improved indexes, and refined metadata, then flow back into the serving layer. Every iteration should improve the next one. At least in theory. In practice, most organizations still cannot operate this loop efficiently because the underlying infrastructure remains fragmented. Today, if a team wants to perform large-scale offline processing on production vector data, the typical workflow is still painfully manual. Data must first be exported from the vector database into a lake or batch environment. Indexes usually cannot be reused. Synchronization pipelines become brittle. Incremental updates are difficult. Processed results must eventually be re-imported into the serving system, often with no atomic consistency guarantee between the new data and the new indexes. The result is a workflow that is slow, fragile, and expensive. And because it is so expensive to maintain, many organizations simply avoid doing continuous discovery. The data sits there, retrievable but largely unexplored. This increasingly reminded us of the historical gap between OLTP and OLAP systems, except now the fragmentation is between online semantic retrieval and offline unstructured data processing. Why existing architectures eventually hit their limits One thing we became increasingly convinced of is that neither side of the current infrastructure stack is wrong. Vector databases and Lakehouse systems both solve important problems. The issue is that each architecture was optimized around only one half of the emerging workload. Vector databases were designed primarily for online retrieval. Take the open-source Milvus https://github.com/milvus-io/milvus as an example. It solves vector search at scale extremely well. But when workloads move beyond serving and into large-scale discovery, natural architectural boundaries appear. Cold data processing becomes expensive. Billion-scale distributed clustering is difficult to express through online search APIs. Many systems assume data must remain loaded into online infrastructure in order to remain queryable. Enterprises that already store massive unstructured datasets in lake environments face migration cost and governance fragmentation when they are asked to move everything into a dedicated retrieval system. These are not implementation bugs. They are consequences of optimizing for low-latency online retrieval. Lakehouses solve storage efficiency and batch processing, but were designed around structured data abstractions The opposite approach, starting from the Lakehouse side, introduces a different set of tradeoffs. Lakehouses solve storage efficiency and batch processing elegantly. But they were designed around structured data abstractions. In most lake architectures, vectors are still treated as long arrays of floats rather than first-class semantic objects. File formats like Parquet were not designed around ANN indexes, inverted indexes, or low-latency semantic retrieval paths. We saw this directly with a pharmaceutical customer doing molecular similarity search. A brute-force Spark scan across lake data was roughly 1000x slower than indexed vector retrieval using IVF-based search. The exact number depends on data distribution, index parameters, and hardware, but the lesson is stable: without the right index, many semantic workloads are not economically practical. There is also a more basic storage problem. Object storage can introduce severe I/O amplification for retrieval-oriented workloads. Semantic search often finds a small number of IDs, but the application still needs the full records behind those IDs. With traditional columnar formats, retrieving a few small records can require reading large storage blocks. That is fine for scans. It is a poor fit for low-latency serving. Over time, our conclusion became difficult to avoid: the industry should not have to choose between vector databases and lake architectures. It needs an architecture where retrieval and large-scale discovery are native parts of the same operational system. What we mean by Vector Lakebase That realization led us toward what we now call Vector Lakebase https://zilliz.com/blog/from-vector-database-to-vector-lakebase . The core idea is not “a vector database plus a data lake.” I think that framing misses the deeper architectural point. The goal is to create a unified operational layer for unstructured data, one where online serving, offline discovery, and elastic compute all operate against the same logical data foundation. For raw data, that means vectors, documents, metadata, logs, and indexes are managed together on lake-native storage. For data that already lives in Iceberg, Lance, Parquet, or object storage, it means the system can map and index that data without forcing a full migration. Once you start from that requirement, the architecture has to solve several hard problems at once. Compute needs to scale independently from storage. Indexes need to become part of the data layer, not an external acceleration trick. New data and new indexes need to be published together as consistent snapshots. And existing lake data needs to become searchable without creating another copy. Those ideas sound simple. Making them work while preserving the performance people expect from a vector database is the hard part. That is where the lower-level engineering decisions start to matter. The cost of separating storage and compute and how we address it Storage-compute separation is necessary for the CS/CD loop, but it is not free. Slow cold start If compute can scale down to zero, the first query in an on-demand or offline workflow may hit pure cold data. The node has no local index, no warm cache, and no resident data. Everything has to come from object storage. For small datasets, that is manageable. For large vector workloads, it quickly becomes unacceptable. Consider one billion 768-dimensional vectors. A conventional HNSW index can be around 340 GB. Pulling that full index from S3 can take more than four minutes. Nobody wants to wait four minutes before a search can begin. Our answer is to make the cold path much smaller. Using RaBitQ-style 1+3 bit quantization, we can compress that roughly 340 GB index to about 13 GB. Search runs in two stages. The first stage uses a 1-bit representation for coarse filtering, with roughly 85 to 90 percent recall while reducing the data size to about one thirtieth of the original. The second stage uses the 1+3 bit representation to rerank and refine results to around 95 percent recall. That brings cold start from minutes down to roughly 5 to 10 seconds. We then use IVF clustering to reduce the amount of data touched per query. In a representative setup, each query scans around 3 percent of the data. The path becomes: 340 GB of conventional index, compressed to 13 GB, with a single query touching roughly 400 MB after pruning. This is the difference between elastic vector search as an idea and elastic vector search as a usable system. I/O amplification Cold start is only one side of the problem. The other side is record access. Vector search returns IDs. But applications need full records: text chunks, metadata, document pointers, permissions, timestamps, image attributes, or other fields. In a standard Parquet layout, a small point read can force the system to download a large row group. A query may need only a few kilobytes of useful data but end up pulling tens of megabytes from object storage. Shrinking row groups helps point reads, but it hurts compression and scan efficiency. That is why we built Loon, the rebuilt storage engine behind Zilliz Vector Lakebase. Loon uses mixed file formats, row alignment, and manifest-based versioning. Scalar fields can use columnar layouts that remain efficient for filtering and scans. Vector fields and point-query-heavy data can use layouts that are better suited for low-latency retrieval. Column groups align row IDs so the system can fetch the fields it needs without dragging large unrelated blocks through the network. Under the hood, Loon uses Vortex https://github.com/spiraldb/vortex , an open-source file format under the Linux Foundation. Vortex supports flexible layouts and nested encodings, including point queries without decompressing large irrelevant blocks. In one internal test with 3 million rows, 128-dimensional vectors, S3 storage, and 256 concurrent readers, Parquet point reads downloaded about 9.4 MB per read. Vortex downloaded about 0.07 MB. That is a 135x reduction in downloaded data. Full-scan throughput was also higher in that setup. The point is not just that one format is faster for one benchmark. The point is that serving and discovery need different access patterns over the same logical data. Online systems need fast point reads. Batch systems need efficient scans. A Vector Lakebase has to support both without forcing users to maintain two copies of the data. Vector Lakebase: one data foundation, multiple compute modes Once the data layer is shared, compute cannot be one-size-fits-all. Different AI workloads have very different shapes. Some need predictable low latency all day. Some need an interactive search session for ten minutes. Some need a large batch job that runs overnight and then disappears. That is why Zilliz Vector Lakebase supports three compute modes. Long-running compute is for production serving. The cluster stays resident. Hot indexes and data are preloaded into memory and local disk. Queries return in milliseconds. This is the right mode for production RAG, real-time recommendation, personalization, online agents, and any workload where latency is part of the user experience. On-demand compute is for interactive work . It starts in seconds and is billed at minute-level granularity. This is useful for similarity exploration, anomaly inspection, cold-data retrieval, or ML engineering workflows where the dataset should remain searchable but does not justify a 24/7 cluster. Offline Batch compute is for large processing jobs : vector clustering, training-data deduplication, full re-embedding, index rebuilding, and data quality scans. Resources are allocated to the job and released when the job completes. The handoff back to serving happens through the Catalog as a new snapshot. Serving continues reading the old snapshot until the new data and indexes are ready. Then the new version becomes visible atomically. That atomic switch matters. Discovery is only useful if improvements can flow back into production without exposing half-built indexes or inconsistent data. One customer example shows why the distinction matters. An autonomous-driving customer had one billion 768-dimensional vectors, but only needed about 20 minutes of online query time per day. Running the workload as a long-running cluster cost roughly $7,000 per month. Moving it to the on-demand mode reduced the monthly cost to around $500. The same customer had a deduplication workflow that previously spent about 70 hours doing ANN searches one by one. Reworking it as an offline batch job reduced the compute time to around 10 hours on the same resource class. The lesson is not that one compute mode is better than another. The lesson is that AI data workloads are not one shape, and the architecture should not force them into one. Resource scheduling becomes part of the Vector Lakebase The three compute modes only work if resource scheduling is just as elastic as the compute itself. Traditional database schedulers usually assume a fixed pool of machines. Given these nodes, the system decides where to place data and how to balance loads. That model works well when the workload is steady. It is a poor fit for AI workloads that appear in bursts: an on-demand search session, a short inspection of cold data, an overnight deduplication job, then hours of nothing. In that world, the better question is not only where the data should run. It is whether the compute should be running at all. This is why Vector Lakebase has to schedule data and resources together. In practice, that means keeping a Warm Pool of prepared nodes, attaching data quickly when work arrives, keeping resources warm briefly after the request, and releasing them when they are no longer useful. It also changes the economics. This is not the same as per-request serverless pricing, and it is not the same as dedicated monthly capacity. For many AI data workloads, minute-level usage is the more natural unit: pay for compute while the loop is running, then let it disappear. There is a larger architecture shift behind this, from a control plane that manages a mostly static kernel to a kernel that understands resources, cache state, snapshots, and cost. That deserves its own post. For this article, the important point is simpler: without this resource model, Long-running, On-demand, and Offline Batch would be three separate deployment choices, not three parts of the same elastic data system. External Collection: meeting data where it already lives There is one more reality we had to design for. Most enterprises already have large amounts of unstructured data in lake environments: Lance tables, Iceberg tables, Parquet datasets, and object storage directories. Asking them to move everything into a new system before they can use it is not realistic. That is why we built External Collection within Zilliz Vector Lakebase. External Collection is not only a zero-copy mapping. It builds an independent indexing layer on top of external data. The original data stays where it is and remains governed by the customer’s existing platform, while Zilliz builds and manages the vector indexes, inverted indexes, and JSON indexes needed to make that data searchable through the same retrieval path as native data. Our internal principle became simple: One Data. One Index. No duplicated storage. No dual-write pipelines. No fragmented discovery paths. This means the CS/CD loop can cover more than the data already imported into a vector database. It can include the unstructured data assets enterprises already have in their lakes. What defines the first generation of Vector Lakebase These ideas are not just paper architecture. We are already shipping them in Zilliz Vector Lakebase, and the process of building it has made our view of the category much more concrete. A first-generation Vector Lakebase has to get a few things right at the same time. First, storage-compute separation with multi-layer caching. Data lives in object storage, and compute can scale independently, including down to zero. But separation alone is not enough. Online vector search still needs memory, local disk, warm nodes, and cache-aware execution to keep hot queries ms-level fast. Second, unified management for multimodal unstructured data. The system should manage not only vectors, but also source documents, images, audio, video, embeddings, scalar metadata, permissions, and indexes. A system that only stores vectors is an index service, not a data foundation. Third, native vector database capabilities. Millisecond ANN search, index lifecycle management, hybrid search, scalar filtering, full-text retrieval, JSON filtering, and multiple similarity metrics must be built in. Connecting a Lakehouse to an external vector database does not remove fragmentation. It just creates another pipeline. Fourth, multiple compute modes. Online serving, on-demand interaction, and offline batch processing need to operate over the same logical data. On-demand compute is especially important because it becomes the bridge between production serving and large-scale offline processing. Fifth, open formats and no forced migration. The storage layer should be readable by external engines such as Spark, Ray, and Daft. Existing Iceberg tables, Lance datasets, and Parquet files should be able to join the system without unnecessary copying. Data belongs to the user, not the engine. Sixth, resources should follow the data. Compute can disappear when it is not needed, while metadata remains visible and queryable. A request can bring resources back in seconds. Idle tenants should not pay for dedicated compute they are not using. This is not just autoscaling; it requires the engine to make resource decisions together with data decisions. These are our current beliefs, not the final word. We will keep revising them as the system matures. But one pressure seems unlikely to change: unstructured data will keep growing, while infrastructure budgets will not grow at the same rate. That means AI systems need to become more iterative, more efficient, and more continuously adaptive. Vector databases are not disappearing So, returning to the original question: does this mean vector databases are going away? Not at all. If anything, semantic retrieval becomes more important in this architecture. But its role changes. Vector databases become the serving engine inside a larger unstructured data system, much like transactional databases remained essential inside the broader Lakehouse era. OLTP systems were not replaced by Lakehouses. They became one layer inside a larger architecture stack. I believe vector databases are undergoing the same transition now. The broader shift happening underneath AI infrastructure is not simply about retrieval. It is about building continuous operational loops around unstructured data itself. Serving generates feedback. Discovery improves data quality. Those improvements flow back into production. Every turn of the loop makes the system better. Everything else, including storage formats, caching hierarchies, indexing systems, elastic compute models, and resource scheduling, exists to make that flywheel economically viable at scale. We still do not know exactly what Vector Lakebase will become over the next five years. When we started Milvus nearly a decade ago, we also could not have predicted where vector databases themselves would lead. But one thing feels clear now. Unstructured data will continue growing. Models will continue changing. Agents will generate more traces, feedback, and state. Teams will need to improve their data faster without letting infrastructure cost grow without limit. The systems that succeed will be the ones that make continuous serving and continuous discovery feel like part of the same machine. That is the direction we are building toward. Zilliz Vector Lakebase is available in public preview We've launched the public preview of Zilliz Vector Lakebase https://zilliz.com/blog/from-vector-database-to-vector-lakebase — a major evolution of Zilliz Cloud from a managed vector database to a unified semantic data platform, combining low-latency vector serving with the openness, scalability, and economics of a data lake. Zilliz Vector Lakebase core capabilities: - Tiered serving optimized for different real-time performance-cost trade-offs - On-demand search for large-scale or exploratory workloads without always-on compute - External data lake search — index and search directly over your existing lake data - Full-spectrum search across vectors, text, JSON, and geospatial data with hybrid retrieval and reranking - Unified lake-native storage built on Vortex, an open format with faster and cheaper random reads than Lance or Parquet If your current stack splits serving and discovery into separate systems, Vector Lakebase might be worth a look. Try it on Zilliz Cloud https://cloud.zilliz.com/ — new work email signups get $100 free credits — or talk to us https://zilliz.com/contact-sales about your use case. Note: The performance and cost figures in this article come from open source VectorDB Benchmark results, internal testing, and anonymized customer scenarios. Actual results vary based on data scale, distribution, index parameters, workload shape, and resource configuration. Keep Reading Top 10 Context Engineering Techniques You Should Know for Production RAG A practical guide to context engineering for production LLM systems, covering RAG, context processing, memory, agents, and multimodal context. DeepSeek-OCR Explained: Optical Compression for Scalable Long-Context and RAG Systems Discover how DeepSeek-OCR uses visual tokens and Contexts Optical Compression to boost long-context LLM efficiency and reshape RAG performance. Demystifying the Milvus Sizing Tool Explore how to use the Sizing Tool to select the optimal configuration for your Milvus deployment.