Qumulo says its Cloud AI Accelerator offering gets data from distributed on-prem and public cloud sites to GPU accelerators without it needing to be copied and staged to all-flash stirage closely coupled to the GPU servers.
It tells us that, according to a recent analysis, the average enterprise GPU utilization hovers around a staggering 5 percent. This means hundreds of billions of dollars’ worth of accelerated compute infrastructure sits idle roughly 95 percent of the time because data must be staged, replicated, and moved into position before a workload can even start. Improved tokenomics has to consider total creation time, not just the last mile.
Qumulo CEO Doug Gourlay said: “Every enterprise we talk to is focused on GPU availability, but availability is only half the problem. The deeper issue is utilization, and the culprit is data gravity.”
“The industry's response has been to sell enterprises more tightly-coupled storage attached directly to GPU clusters, which optimizes a tiny window of active compute time while doing nothing about the idle time that surrounds it. This only leads to more expensive tokens and storage islands to maintain. Cloud AI Accelerator was built to solve the actual problem of getting the data to the GPUs instantly, wherever they are, without ever copying it.”
The company says that its Cloud AI Accelerator creates GPU liquidity by building an intelligent data fabric that integrates its Cloud Native Qumulo (CNQ), Cloud Data Fabric, and NeuralCache offerings across on-premises, edge, and multi-cloud environments.
This allows enterprises to run workloads wherever GPU capacity is available, rather than, it says, from wherever data happens to be trapped.
Qumulo’s Cloud Native Qumulo (CNQ) is Qumulo’s CDP running natively in AWS, Azure, the Google Cloud Platform, and Oracle Cloud Infrastructure. Cloud Data Platform (CDP) is its scale-out and clustered filesystem software running on-premises. The company announced its Cloud Data Fabric (CDF) in February last year, and it has a central file and object data core repository with coherent caches at the edge. The core repository is a distributed file and object data storage cluster that runs on most systems, vendors’ server hardware, or public cloud infrastructures.
NeuralCache predictive caching was added to the Cloud Data Fabric in April 2025, and uses AI and machine learning models to dynamically optimize read/write caching,
The company actually introduced its Cloud AI Accelerator last November. It is a way of moving data from Qumulo Cloud Data Fabric stores to a GPU server, using NeuralCache technology to predictively cache and reduce GPU data load times by up to 64 percent.
Now it says that the Cloud AI Accelerator’s AI-focused data fabric makes providing data to GPUs “a flexible scheduling operation, delivering any enterprise dataset in real time to any GPU farm in any cloud.” Enterprise customers can;
- Connect Without Copying: Seamlessly and securely connect on-premises or cloud-native Qumulo systems to Microsoft AI Foundry, AWS Bedrock, and Google Vertex AI without copying data.
- Capture Global GPU Capacity: Run AI workloads wherever and whenever GPU capacity becomes available, across any region, cloud, or availability zone.
- Eliminate Staging Delays: Wipe out the weeks-long data-staging delays that keep GPU infrastructure idle before training or inference workloads begin.
- Eradicate Storage Islands: Avoid maintaining multiple, isolated, and replicated storage silos across every environment where GPUs might be sourced.
- Slash Idle Compute Costs: Drastically reduce idle GPU costs by eliminating the heavy load phase into GPU-attached flash storage.
Qumulo emphasizes that its Cloud AI Accelerator drastically reduces idle GPU costs by eliminating the heavy load phase into GPU-attached flash storage. We understand that, with Qumulo, data streams at block level from source sites; edge/data center on-prem, cloud, cross-region, or CNQ S3-backed, to the Accelerator's cache (CPU DRAM), then directly to GPUs. Cloud AI Accelerator shrinks overall AI training and inference tine, Qumulo says, not just the last few inches of data movement from a previously loaded all-flash box tightly-linked to a GPU server. It, in effect, enables GPU resources to flow, as it were, to become available to data wherever it is; GPU liquidity.
The company does not directly support Nvidia’s STX reference architecture and its KV caching scheme; what Gourlay might calls “a tiny window of active compute time.” Such support would entail Qumulo’s CDP running on Nvidia’s BlueField 4 DPUs and supporting the relevant Nvidia software services, such as Dynamo.
Qumulo’s Cloud AI Accelerator has Cisco networking and security linking and safeguarding its CDP sites. Together, Cisco and Qumulo “enable enterprises to build agile AI infrastructure that adapts in minutes to changing GPU availability, providing the operational flexibility that makes GPU liquidity achievable at enterprise scale.”
The Cloud AI Accelerator is available now across AWS, Azure, Google Cloud, and Oracle Cloud Infrastructure (OCI), with hybrid deployment support for Cisco UCS on-premises environments.
Qumulo will be present at the Cisco Live 2026 event in Las Vegas, booth #4018, May 31 - June 4.
Comment
Our understanding is that, through its partnership, with Nivida, Cisco offers its AI PODs as part of the Secure AI Factory with Nvidia. These are pre-validated, modular designs using Cisco UCS servers, Nexus networking, and third-party storage.
These AI PODS are different from Nvidia’s BasePODs and SuperPODs, as they are based on Cisco's own validated designs (CVDs) rather than Nvidia specifications - which Dell and Supermicro use.