IBM Storage Scale parallel filesystem can be used for distributed KV Cache management with Nvidia’s Dynamo in large-scale AI inference deployments.
An IBM Redbook, Context Without Limits: A High-Performance KV Cache Platform for Large-Scale AI Inference, provides a reference architecture for this, using Supermicro Petascale Storage Servers, Nvidia’s Spectrum-X Ethernet and Storage Scale Erasure Coding Edition (ECE) as a high-performance shared storage tier. Redbooks are IBM tech publications, generally produced by its International Technical Support Organization (ITSO) and providing in-depth, “how-to” practical information about deploying IBM products.
This Redbook, written by IBM, Supermicro and Nvidia engineers, explains that long-context workloads, including multi-turn assistants, retrieval-augmented generation (RAG) applications, and autonomous agent pipelines, generate large volumes of key-value (KV) cache data in the GPU server’s High Bandwidth Memory (HBM) that needs retaining across requests to avoid the data being recomputed if it’s evicted from HBM.
The KV Cache scheme employs a multi-layer cache comprising;
GPU Node HBM (G1 layer)
CPU Node DRAM (G2 layer)
Local SSD (G3 layer)
Pod-level shared flash tier with SSD storage front-ended by BlueField DPUs directly linked to BlueField DPUs in the GPU server (G3.5 layer).
External shared storage (G4 layer) linked to the GPU servers across Ethernet
Together, these tiers provide a continuum of capacity and latency targets, enabling Nvidia Dynamo to intelligently place, evict, and reload context across the full storage stack depending on workload access patterns and cost constraints.
The G4 tier; Storage Scale ECE running on the Supermicro servers, can be used for KV cache content that is not latency critical, such as inactive multi-turn session state, shared agent context, and historical query artifacts.
The Redbook authors say that, wth this G4 layer, customers can accelerate production GenAI and agentic AI inference with this validated reference architecture. They say that, in single request testing measuring TTFT (Time To First Token) vs a GPU server system without a Storage Scale external KV Cache store, TTFT remains nearly flat across all prompt sizes, delivering a 56x speedup with an input sequence length of 130k tokens and eliminating prompt-length sensitivity for inference latency.
Under concurrent load, the system demonstrated throughput increases from 0.19 requests-per-second (RPS) to 4.26 RPS, a 22x improvement. Total processing time for 200 requests dropped by 95 percent, confirming significantly improved GPU utilization and scalability for high-volume inference workloads.
Under a noisy-neighbor stress test with four concurrent clients generating 200 GB/s of competing network I/O, Storage Scale ECE was able to sustain inference at 3.6 RPS and completed all 200 requests in 55.56 seconds. This result is an 18x throughput improvement over the GPU recompute baseline RPS.
The authors write: ”For enterprises seeking to maximize the return on their GPU infrastructure investment, this architecture delivers a clear and immediately deployable path to higher throughput, lower latency, greater concurrency, and a fundamentally more cost-efficient inference platform.”