{"slug": "ray-data-2-56-improving-reliability-for-ai-data-pipelines", "title": "Ray Data 2.56: Improving Reliability for AI Data Pipelines", "summary": "Ray Data 2.56 introduces memory-aware execution and improved prefetching to reduce out-of-memory failures and unnecessary object spilling in AI data pipelines. The update includes automatic batch size tuning and better process termination prioritization, leading to fewer crashes and improved reliability for batch inference and training workloads.", "body_md": "# Ray Data 2.56: Improving Reliability for AI Data Pipelines\n\n[Balaji Veeramani](/blog?author=balaji-veeramani),\n\n[Justin Yu](/blog?author=justin-yu),\n\n[Ray Huang](/blog?author=ray-huang),\n\n[Ayush Kumar](/blog?author=ayush-kumar),\n\n[David Dai](/blog?author=david-dai),\n\n[Akshay Malik](/blog?author=akshay-malik)and\n\n[Richard Liaw](/blog?author=richard-liaw)| June 30, 2026\n\nRay Data is one of the most common solutions for large-scale batch inference and training ingest pipelines. In Ray Data 2.56, we focused on two of the biggest reliability challenges users reported: out-of-memory (OOM) failures and unnecessary object spilling.\n\nBoth of these issues significantly slow down Ray Data pipelines and can often crash the job outright. For example, spilling can lead to out-of-disk errors, and significant memory pressure can cause the operating system to terminate critical driver processes.\n\nIn the development of Ray Data 2.56, we’ve spent time tackling both problems, aiming to provide users with a much more reliable experience. With the latest updates, we saw reduced OOMs and spilling on a variety of internal workloads\n\nThese gains were primarily driven by:\n\n**Memory-aware execution.** Ray Data now registers task memory more accurately, automatically tunes batch sizes for CPU`map_batches`\n\nworkloads, and prioritizes terminating idle workers instead of critical processes during memory pressure.**Improved prefetching for training.** Ray Data now has less contention on the GPU stream and reduced memory usage on training threads, leading to better performance and stability.\n\nWe cover these changes and investigations in the sections below.\n\nTry out Ray Data 2.56 by running `pip install -U ray`\n\n!\n\n## LinkFewer OOMs, fewer crashes\n\nUsers running Ray Data prior to 2.56 often report seeing out-of-memory (OOM) errors, which sometimes also leads to Ray Data pipelines crashing. This is often due to memory oversubscription, where multiple tasks are run on the same node but requesting and provisioning more memory in aggregate than available on the machine. Memory oversubscription will then trigger the operating system to start killing processes, resulting in OOM warning messages and dying components and failing pipelines\n\nTo address this issue, we need to:\n\nMake sure Ray Data tasks register the appropriate amount of memory with the Ray scheduler\n\nMake sure that processes are terminated in order of lowest priority when out-of-memory errors do occur.\n\nFor memory registration, Ray Data 2.56 offers two main improvements. First, Ray Data features a flag `DataContext.get_current().default_map_logical_memory_enabled`\n\n. By setting this flag, all map tasks, which can include tasks executed under map_batches, flat_map, map, and reads, will have their logical memory configured to 4GB per CPU (if not already configured).\n\nWe also introduced a batch size tuning feature for CPU `map_batches`\n\ncalls, enabled by passing `batch_size=”auto”`\n\nto your `map_batches`\n\ncalls. Now, instead of users manually tuning batch sizes, the system profiles row sizes and automatically chooses a batch size such that the size of the inputs is within a safe threshold (in this case, under 16mb).\n\nWe’ve also improved Ray’s process management capabilities for high memory scenarios. Ray also now handles OOM process termination itself rather than leaving it to the underlying operating system. This lets Ray terminate idle workers instead of critical driver processes, preventing node deaths or unnecessary task kills. In addition, Ray also now reserves a fixed memory overhead for system processes like the raylet, giving more headroom for application memory usage.\n\nTo measure the impact of the changes, we ran an audio transcription benchmark [[ link](https://github.com/ray-project/ray/blob/eeb194f3bdce65a7e21bba971e20aecc068d668f/release/nightly_tests/multimodal_inference_benchmarks/audio_transcription/ray_data_main.py)] on 8 g6.xlarge instances on AWS. Going from 2.55 to 2.56, OOMs dropped from\n\n**300+** to\n\n**0**, and end-to-end runtime fell from\n\n**1055s** to\n\n**447s**.\n\n**2.55**\n\n**2.56**\n\nFor a deeper guide on diagnosing and avoiding OOMs, see the new user guide: __https://docs.ray.io/en/master/data/how-to-avoid-ooms.html__\n\n## LinkLess spilling, faster pipelines\n\nRay Data will also often have spilling occur on workloads such as training ingestion. In understanding these issues, a major cause of these issues is due to incorrect object store memory usage estimation.\n\nRay Data maintains an internal estimate of how much object store is currently used. This internal metric would also be used for triggering backpressure (i.e., signaling to upstream operators to slow down production). However, if the estimate is wrong, backpressure will misfire or not kick in at all, causing unnecessary spilling in the data pipeline.\n\nTwo of the main causes of this were **Pandas block formats** and, in the case of training dataloading, **not fully accounting for object store memory used by the prefetching and collation pipeline on the training workers.**\n\n**Pandas block format: **Ray Data previously had multiple block formats for internal data representation -- one of those being Pandas Dataframes. Pandas Dataframes, however, did not provide an effective approach for accounting for the total memory size especially for generic Python objects. This would lead to inaccurate estimation. In Ray Data 2.56, we’ve consolidated block formats and now use PyArrow as the only block format, which has a much more accurate in-memory size estimation.\n\n**Improving prefetching: **In our investigation into the reported spilling issues, we noticed that Ray Data prefetched multiple times the number of configured `prefetch_batches`\n\nper worker despite not having any improvement to throughput. The extra prefetched data contributes to object store memory usage and was mostly untracked by Ray Data’s usage estimation, which would cause spilling to disk if the data exceeded the object store spilling threshold.\n\nFurther, these prefetched batches were both pinned to object store and were not being accounted for in the object store memory usage, meaning that Ray Data was underestimating the amount of object store it was using on the training worker nodes.\n\nBy reducing the amount of prefetched batches and accounting for them properly, we were able to reduce object store usage leading to less spilling and lower peak object store memory usage.\n\nWe also noticed that prefetching from CPU to GPU was contending with the default CUDA stream, resulting in queuing of compute kernels on the GPU side and thus lowering training throughput. By reducing the amount of prefetching, we were also able to see training throughput improvements as well.\n\n**Experiments: **In our internal object store memory [ backpressure stress test](https://github.com/ray-project/ray/blob/14ee057472ffaa034ded79fbca4d7dcb82846119/release/nightly_tests/dataset/backpressure_benchmark.py#L4), we saw that spilling in 2.56 was eliminated (from 70GiB in 2.55), and peak object store memory reduced by 41%.\n\nIn another benchmark where dataloading is the bottleneck, we actually saw a 25% training step throughput increase due to reducing CUDA stream contention from prefetching fewer unnecessary batches to GPU. See this [ PR](https://github.com/ray-project/ray/pull/63682) for more details.\n\n## LinkOther Reliability Improvements\n\nBeyond the above two set of issues (OOMs, Spilling), we’ve introduced a variety of other reliability and stability improvements as well:\n\n**Multiple dataset support:** Ray Data2.56 adds multiple dataset support with subcluster label scheduling. This unlocks common patterns like running validation (sync or) alongside training, dataset multi-tenancy, and keeping training-dataset preprocessing workers off your training GPU nodes. We document__async__.__these improvements here__**Training shuffle improvements:** We improved local shuffle buffer performance to reduce memory usage by up to 2.5x and 3x higher throughput on a variety of workloads. We also documented training shuffle.__best practices here__**Scheduling loop scalability:** We received reports of scheduling loop latency degradation at high worker volumes. We have landed a number of improvements to the scheduling loop to increase throughput, and currently see scheduling loop P90 latencies drop by up to 6x at large scale (2000+ workers). We are continuing to do work here in preparation of Ray Data 2.57.\n\n## LinkLooking forward to 2.57\n\nWe still have more stability improvements and features slated for 2.57 – we’ll have another blog post when that is released. A quick preview of Ray Data 2.57 improvements include:\n\nFully fixed accounting for prefetched training batches. 2.56 mitigated the issue and reduced budget overages significantly, but 2.57 will fix the root cause fully.\n\nMid-epoch resumption for faster checkpoint-based recovery for training workloads\n\nDatasource V2, which will have much better schema handling and inference\n\nFault-tolerant shuffles\n\nUpgrade to 2.56 to pick up these stability and performance gains. As always, we'd love your feedback -- try out Ray today via `pip install -U ray`\n\n.", "url": "https://wpnews.pro/news/ray-data-2-56-improving-reliability-for-ai-data-pipelines", "canonical_source": "https://anyscale.com/blog/ray-data-256-updates", "published_at": "2026-06-30 00:00:00+00:00", "updated_at": "2026-06-30 16:24:48.739244+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-tools", "machine-learning"], "entities": ["Ray Data", "Balaji Veeramani", "Justin Yu", "Ray Huang", "Ayush Kumar", "David Dai", "Akshay Malik", "Richard Liaw"], "alternates": {"html": "https://wpnews.pro/news/ray-data-2-56-improving-reliability-for-ai-data-pipelines", "markdown": "https://wpnews.pro/news/ray-data-2-56-improving-reliability-for-ai-data-pipelines.md", "text": "https://wpnews.pro/news/ray-data-2-56-improving-reliability-for-ai-data-pipelines.txt", "jsonld": "https://wpnews.pro/news/ray-data-2-56-improving-reliability-for-ai-data-pipelines.jsonld"}}