Embedding pipelines are the new ETL

Teams building AI applications are discovering that embedding pipelines for retrieval-augmented generation (RAG) systems are fundamentally a data engineering problem, not a new AI discipline. The process of ingesting, chunking, and indexing raw documents into vector databases mirrors traditional ETL (Extract, Load, Transform) workflows, with versioning, data freshness, and lineage issues that teams have already solved in data infrastructure. Treating these pipelines as production-grade data systems rather than weekend prototypes prevents the common failure of AI systems that deliver outdated answers and lose user trust months after launch.

I’ve seen a lot of promising AI prototypes fall apart after launch. And it’s rarely because the model was bad. More often, the problem starts much earlier; teams treat the data layer like something they can figure out later. They’ll spend weeks fine-tuning prompts, testing models and debating evaluation scores, then throw together the retrieval pipeline over a weekend and move on. At first, everything looks great in demos. But a few months later, the system gives outdated answers; the embeddings no longer match the source documents, and nobody fully understands what changed. What started as an impressive prototype slowly becomes difficult to trust in production. The teams that avoid this tend to realize one thing early: Embedding pipelines are fundamentally a data engineering problem, not an entirely new AI discipline. It’s still ETL Extract, Load, Transform at its core, but with embeddings and vector stores as the destination instead of a warehouse. Once you start looking at it that way, a lot of things become clearer. Problems like versioning, data freshness, lineage and retries stop feeling “AI-specific.” They’re data infrastructure problems we’ve already spent years learning how to solve. Large language models are extraordinary reasoners trapped inside a time capsule. When training ends, the model’s knowledge is sealed. It does not know what your team decided in last quarter’s strategy review. It has never read the support ticket that came in this morning. It cannot find the clause buried on page 47 of your master service agreement. It’s brilliant, but blind to anything specific to your organization. Layer on top of that a hard context window limit, a ceiling on how much text the model can process in a single interaction, and you have a clear problem: you cannot just hand it everything you own. The answer the industry converged on is retrieval-augmented generation https://arxiv.org/abs/2005.11401 , or RAG. Instead of stuffing everything into the context window, you build a retrieval layer that fetches only the most relevant pieces of information at the moment a question is asked and passes just those to the model. That retrieval layer is powered by a vector database https://www.pinecone.io/learn/vector-database/?utm term=vector%20database&utm campaign=vector-db-eu&utm source=adwords&utm medium=ppc&hsa acc=3111363649&hsa cam=19646985287&hsa grp=142661465661&hsa ad=647054972068&hsa src=g&hsa tgt=kwd-1976865318&hsa kw=vector%20database&hsa mt=p&hsa net=adwords&hsa ver=3&gad source=1&gad campaignid=19646985287&gbraid=0AAAAABrtGFDlqCblBjaZKFfBvXrPuZ4tL&gclid=CjwKCAjw8arQBhB9EiwAfIKdQlrdxhRJxVZTd8-HJjw3dlTcHHa2lD3tOFTO8ApiDnzFXLIze-VP7RoCOuMQAvD BwE , and the process that populates it, which is taking raw documents and transforming them into searchable semantic representations, is what I mean when I say embedding pipeline . Every team building an internal AI assistant, a smarter enterprise search tool, an automated customer support agent or a document Q&A system needs one. The question is not whether to build it. The question is whether you build it like a prototype or like infrastructure. An embedding pipeline has three stages: ingestion, chunking and indexing. Here is what each one means and how I relate them to a typical ETL process. Getting your raw content, PDFs, wiki pages, Word documents, database records, transcripts, out of wherever it lives and into the pipeline. This is ETL’s extract stage, almost verbatim. I see teams cut corners here more than anywhere else, and it’s often where production systems first start to fail. A document gets updated, but the pipeline doesn’t pick it up. A file gets deleted, but its chunks remain in the index, still returning outdated answers months later. And because there’s no obvious error, no one reports it. The fix is C hange Data Capture https://www.confluent.io/learn/change-data-capture/ CDC . This maintains a manifest of every document you have ingested, a content hash and a timestamp. On each run, we compare sources against that manifest, re-ingest what changed, delete what is gone and treat your document the way you would treat any source table you are syncing incrementally. Once your documents are in the pipeline, you cannot embed them whole. A 30-page technical report is too long to represent meaningfully as a single vector, and even if it were not, returning the entire report in response to a narrow question would bury the model in irrelevant context. Chunking is the process of breaking each document into smaller pieces that are focused enough to embed accurately and retrieve precisely. This is ETL’s transform stage, and it deserves the same level of design discipline. The most common mistake I see is treating chunk size as a default configuration option rather than a product decision. It is not. The right chunk size depends entirely on the nature of your content and the nature of your queries. Dense technical documentation needs finer granularity than a collection of FAQs. A legal contract with clause-level logic needs different treatment than a set of onboarding emails. What works for one document will actively degrade retrieval quality on another. My strong preference is to treat your chunking configuration as a versioned pipeline parameter, not hardcoded logic. When you change it, and trust me, you will. You need to re-chunk in a controlled, observable way, compare retrieval quality before and after, and roll back if it degrades. That is just good transform-layer hygiene. It is no different from versioning a data cleaning rule or a field mapping. The final stage is where chunked text gets converted into vectors and stored in a vector database where it can be searched by semantic similarity rather than keyword match. In the conversion step, embedding is handled by a model specifically trained to turn text into dense numerical representations that encode meaning. Two chunks expressing the same idea in different words will produce vectors that cluster close together in that mathematical space. Two chunks discussing entirely different topics will sit far apart. When a user asks a question, the system embeds that question the same way, finds the chunks whose vectors are nearest, and returns them as context for the model to reason over. That is a genuinely new capability. But the discipline around indexing is not. One data engineering principle I keep coming back to is versioning. In embedding pipelines, every chunk in your index should be tagged with the embedding model name and version used to generate it, this is non-negotiable. Embedding models evolve, and vectors produced by different versions are not comparable in a reliable way. You cannot safely search across them as if they are interchangeable. This exact problem shows up when teams upgrade embedding models mid-pipeline without a proper migration plan. You end up mixing vectors from different generations in the same index, and retrieval starts to degrade in ways that are hard to detect. The system just quietly begins returning subtly wrong answers. I treat an embedding model upgrade the same way I treat a schema migration: Plan it explicitly, execute it in full and validate retrieval quality on a representative query set. The stakes are the same as any breaking change to your data model. Once an embedding pipeline is running in production, the question shifts from “did it run” to “did it run correctly.” That distinction matters more here than in most pipelines, because failures are rarely loud because the index looks fine, queries return without errors and the system quietly surfaces wrong answers until someone notices the AI has stopped being useful. The same observability discipline that makes any data pipeline trustworthy applies directly here. Once you treat embedding pipelines as production systems, you stop thinking in isolated steps and start thinking in signals. For example, chunk counts per document become a simple but powerful health check, a sudden drop is usually not a model issue, but a sign of broken ingestion or upstream parsing failures. You also need a “golden set” of queries with known-good outputs. This runs after every pipeline change, much like data quality checks after a transformation. This is how you catch regressions that don’t show up as explicit failures. On top of that, you can track lineage: Which embedding model version produced which chunks, and when each document was last ingested. That makes it possible to trace retrieval issues back to specific changes instead of guessing. And finally, freshness becomes a first-class signal. If documents start going stale beyond an acceptable threshold, that should surface in monitoring long before users experience degraded results. The metric that ties it all together is retrieval quality over time. Treat it like any other pipeline SLA, measured, tracked and owned. Embedding pipelines definitely come with a lot of new language, new tools and a genuinely different capability in the semantic layer. But the funny thing is, the principles that actually make them reliable in production are not new at all. We have versioning, freshness, quality checks and monitoring. These are problems data engineering has already spent years solving. The real work is taking that same discipline and applying it to a pipeline that just happens to output vectors instead of rows in a table. Once you start seeing it that way, a lot of the chaos around AI systems becomes much easier to reason about. That’s the difference between building a cool AI demo and building something people can actually depend on. One is a prototype, whereas the other is infrastructure. This article is published as part of the Foundry Expert Contributor Network. Want to join? https://www.infoworld.com/expert-contributor-network/