# Evolving Dataflow to process massive datasets for machine learning

> Source: <https://cloud.google.com/blog/products/data-analytics/ai-focused-innovations-in-dataflow/>
> Published: 2026-05-28 16:00:00+00:00

Google created [MapReduce](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) more than 20 years ago to solve the scaling problems in data processing that the then young company was running into. The AI era that we are in now demands efficient, large-scale data processing for everything from training frontier models like Gemini by Google DeepMind to powering fully autonomous vehicles like Waymo.

Many aspects of machine learning, including data ingestion, transformation, and feature extraction, rely heavily on processing massive datasets. To meet this astronomical scale required by efforts across Google, we evolved our data platform, Flume, the successor to the original MapReduce, with innovations focused on **scalability**, **efficiency**, and a better **developer experience**. And many of those innovations are available as part of [Dataflow](https://cloud.google.com/products/dataflow), our fully managed batch and streaming platform built on the same core technology Google uses to power its most demanding internal workloads.. In this blog, we provide an overview of the many innovations in the Flume platform, and a glimpse into how Google Cloud customers are putting those features into action with Dataflow.

The scale of data processing at Google has exploded over the last 20 years and continues to drive innovation. To tackle the challenges of immense scale, we introduced several features within Google's data processing platform, which are also available in Dataflow::

**Liquid sharding** dynamically splits work units (shards) during execution for on-the-fly rebalancing. This helps pipelines with uneven data distribution and stragglers to maximize worker efficiency as data grows.

**Global compute** enables enormous scaling by dynamically scheduling workloads across Google's global infrastructure. The system automatically determines the optimal location based on factors like data locality and resource availability.

**Automatic pipeline optimization** fuses consecutive operations into a single stage. This reduces I/O and stage-transition overhead, allowing large-scale execution to scale more gracefully.

**Rate-limiting external API calls** manages load on external services. This is essential for modern ML pipelines that frequently call external APIs for tasks like model evaluation, preventing high data volumes from overloading systems.

**Tandem pools** facilitate serverless remote inference. This feature helps overcome scalability limitations often found in remote inference systems by efficiently hosting, sharing, managing, and autoscaling external model servers.

Doing more with less isn't just a constraint; it fuels our progress. By finding ways to run more efficiently, we create the space and capacity needed for rapid innovation. This is particularly evident for teams that use accelerators like TPUs for their workloads. To improve utilization and cost efficiency, our engineers devised several novel features for our platform, now part of Dataflow:

**Heterogeneous worker pools** allow developers to specify custom resource requirements for different pipeline stages. For example, TPU-intensive work runs on TPU-equipped workers, while other stages use standard CPU workers. This ensures optimal resource allocation.

**TPU-aware autoscaling** prevents excessive initial assignment of TPU workers and improves efficiency during subsequent autoscaling events.

**Duty-cycle policy enforcement** automatically scales down TPU workloads when the accelerator's duty cycle (the fraction of time it is active) is low, scaling back up only when utilization improves.

**TPU fungibility**: By working with other infrastructure teams, we developed optimizations to encourage scheduling jobs to the most suitable TPU version and cell location based on quota and resource availability.

Considering the wide mix of backgrounds and tools across Google, rapid prototyping, iteration, and reliable production operations are extremely important. Google has invested in significant capabilities to enhance the overall user experience:

**Language flexibility** is provided through a versatile SDK with a simple API in C++ (internal to Google), Java, Python, and Go (with SQL support). This allows users to build batch, ML, and streaming pipelines.

**Integration** with ML frameworks like [JAX](https://docs.jax.dev/) is available, along with native support for LLM-specific optimizations. The underlying platform also provides building blocks for robust agentic inference pipelines and supports simple transitions between bulk and streaming paradigms.

**Unified batch and streaming** enables users to use the same code for both historical batch and live streaming data. This simplifies the architecture, which traditionally would have required separate pipelines for batch and streaming data processing.

**Observability** for production pipelines is available through the monitoring UI, which offers comprehensive control and essential diagnostic data. Detailed performance metrics, such as stage-level TPU utilization graphs, provide transparency for troubleshooting and optimization tasks.

**Advanced developer workflows** for quicker day 0 and day 2 operations include features like sampling and dry-run to help ensure code accuracy. Users can also test pipelines on small in-memory collections, and even pause and resume production pipelines.

Dataflow is built upon Google's internal platform, sharing many core components, including the execution engine and the Apache Beam SDK (which originated from Flume’s APIs). This close relationship means that the cutting-edge solutions we build to handle Google’s internal data processing challenges, like pipelines that process hundreds of billions of documents, directly benefit Dataflow users. In fact, unique Dataflow features like vertical scaling, right fitting, dynamic sharding, and straggler detection all resulted from solutions developed for Google’s internal workloads.

This is one of the reasons many Google Cloud customers rely on Dataflow for critical ML applications: **Spotify** uses Dataflow for [large-scale generation of ML podcast previews](https://engineering.atspotify.com/2023/04/large-scale-generation-of-ml-podcast-previews-at-spotify-with-google-dataflow)**. Etsy** leverages Dataflow for [data preparation and ETL](https://cloud.google.com/customers/etsy-ai?hl=en&e=48754805) for its ML workloads. And **Moloco** uses Dataflow to process [terabytes of data a day to update its prediction model](https://cloud.google.com/customers/moloco) for real-time ad bidding.

The momentum continues: Last quarter we launched support for TPU in Dataflow in addition to supporting GPUs. Looking ahead, we are working on an advanced reliability feature called speculative execution and are enhancing the developer experience with features like failure isolation and replay and pause/resume, which are coming soon. To learn more or get started with Dataflow visit [https://docs.cloud.google.com/dataflow/docs/get-started](https://docs.cloud.google.com/dataflow/docs/get-started).
