# Massive AI Storage Demand Creates a New Memory Wall

> Source: <https://www.eetimes.com/massive-ai-storage-demand-creates-a-new-memory-wall/>
> Published: 2026-06-10 14:22:01+00:00

The term “Memory Wall” was [coined](https://link.springer.com/rwe/10.1007/978-0-387-09766-4_234#citeas) in the early 1990s to describe a bottleneck in computer performance: the speed gap between processors and memory, specifically DRAM. The premise quickly insinuated itself into the engineering vernacular with DRAM cast as a laggard technology dragging computing efficiency. That wall remains, but in the AI era, the metaphor takes on a new meaning as DRAM and DRAM-based high-bandwidth memory (HBM) strive to meet the skyrocketing memory needs of large language models (LLMs).

Over the past 30 years, DRAM has met performance scaling requirements through innovative techniques such as cache hierarchies, prefetching, and memory interleaving. Designers developed larger and faster on-chip caches and introduced techniques to predict and pre-load data before it was needed. However, these techniques did not solve the fundamental issue of capacity scaling.

Today’s rapidly expanding AI models are placing an unprecedented strain on the ability of conventional memory architectures to expand capacity ahead of data storage demand. The signs are everywhere—rising DRAM and HBM design and manufacturing costs, higher energy use and heat dissipation, and diminishing scalability options.

**AI inference modeling redefines data retrieval patterns and priorities**

The constraints placed on DRAM-based memories, such as HBM and graphics double data rate (GDDR), come as LLMs expand from billions to trillions of parameters. At the same time, AI inference context sizes—driven by complex prompts, retrieval-augmented generation (RAG), chain-of-thought reasoning, and user-specific data—often require key value (KV) caches larger than the models themselves.

[View All](https://www.eetimes.com/category/sponsored-content/)

The DRAM architecture used today as system memory is less relevant in AI inference workloads, which are predominantly read-heavy and latency-tolerant due to predictable memory access patterns that enable prefetching and buffering. This renders HBM’s narrow focus on raw bandwidth insufficient for workloads requiring both capacity and bandwidth.

These challenges underscore the need for new memory architectures that optimize capacity and bandwidth specifically for AI inference. Rather than relying on consistent, cache-friendly access patterns, AI inference models process highly variable and multidimensional data types. The AI inference memory accesses are deterministic, prefetch-friendly, and feature a large granularity. All of this makes caching hierarchies less relevant than raw sequential bandwidth.

The resulting AI computing paradigm succeeds not by brute-forcing memory bandwidth but by optimizing when and how data is retrieved. In this context, it enables the use of alternate memory technologies tailored for high capacity and sequential bandwidth as a smarter way to address this problem. The new memory challenge becomes optimizing a mosaic of flowing data as opposed to a linear increase in speed.

Historically, data center designers addressed the lack of balance between compute and memory capacity needs by partitioning and distributing AI inference workloads across multiple, expensive accelerators. This technique often wasted computing capacity, but the added cost and power consumption were justified when amortized across large data centers with large user bases whose requests could be processed in large batches. It was only when distributed processing was applied to smaller enterprises with limited user bases, or to large data centers serving disparate customers, that batching became inefficient.

**A new approach to AI memory**

As AI inference workloads grow in scale and complexity, high-bandwidth flash emerges as an alternative. Unlike DRAM and HBM, which are costly, power-hungry, and capacity-limited, high-bandwidth flash leverages the high-density advantages of NAND flash. By using stacking techniques and wafer bonding, such as CMOS directly bonded to array (CBA) technology, these emerging architectures demonstrate higher memory capacity than HBM.

While high-bandwidth flash latency is higher than DRAM, AI inference workloads are increasingly bandwidth-bound rather than latency-sensitive. These new memory designs exploit high-density NAND-based technology tailored to deliver high bandwidth for large-granularity read operations through concurrent accesses across multiple arrays of memory cells. This makes them suitable for LLM storage and read-intensive inference.

In these high-bandwidth memory usage environments, high energy dissipation makes thermal stability at high temperatures a very important requirement. High-bandwidth flash, based on NAND technology, is potentially more stable and better suited for these environments than DRAM. Non-volatility and enhanced endurance versus standard NAND flash also position high-bandwidth flash for persistent KV cache data to be reused to mimic long-term memory.

As AI computing demands continue to grow, relying solely on DRAM and HBM may limit architectural innovation. High bandwidth flash offers data center and edge AI designers a scalable, efficient memory alternative tailored to the evolving needs of AI, where performance is no longer determined by latency but by the efficiency of inference-driven data orchestration.

##### Resources:

McKee, S.A., Wisniewski, R.W. (2011). Memory Wall. In: Padua, D. (eds) Encyclopedia of Parallel Computing. Springer, Boston, MA. [https://doi.org/10.1007/978-0-387-09766-4_234](https://doi.org/10.1007/978-0-387-09766-4_234)

##### Read also:

**The Memory Wall Is Real, Here Is the Do****or**

Relief is coming. The three companies that control virtually all of the world’s DRAM production are investing at a scale the industry hasn’t seen in decades. The harder truth is that those fabs come online in 2027 at the earliest, with meaningful relief unlikely before 2028.
