WEKA NeuralMesh and Augmented Memory Grid software provides 10x higher token throughput, 10x more concurrent users served, and 7x more tokens per GPU when used with Oracle Cloud Infrastructure (OCI) than DRAM-only OCI.
WEKA’s Augmented Memory Grid enables AI models to extend GPU server memory for inferencing to the Neural Mesh’s external storage, using it as a KV Cache with microsecond latencies and multi-TBPs bandwidth, and providing up to additional petabytes of memory address space capacity. It supports Nvidia’s SX KV caching architecture. NeuralMesh is WEKA’s high-performance AI file system software. The joint test results were validated on a nine-node OCI bare-metal H100 cluster with 100,000-token context windows.
Pablo Selem, senior director, software development for OCI, said: “Enterprise AI workloads are pushing context windows and GPU utilization to new limits. These benchmarks show how WEKA’s NeuralMesh platform with Augmented Memory Grid on OCI helps remove memory bottlenecks so customers can support larger, more demanding inference workloads without simply adding more GPUs.”
WEKA says that, as inference demand grows, AI infrastructural inefficiencies have greater and greater effects. Every key-value (KV) cache eviction is effectively a tax: “on GPU cycles, latency, user experience, and the cost of every token served. For long-context and agentic workloads, where inputs routinely run to 100,000 tokens or more, that tax is not a rounding error. It is a direct hit on the unit economics of every organization running production AI.”
The OCI H100 cluster test set-up featured nine nodes, 72 GPUs, 100,000-token context windows, and thousands of concurrent users.
NeuralMesh with Augmented Memory Grid scaled past 5,000 concurrent users vs. about 600 for DRAM-only configurations. This eliminates the failure cliff that hits when cache saturates by expanding the active cache working set from 8.64 TiB of DRAM to 287 TiB of usable NVMe flash storage. In addition, more users per GPU means the same GPU investment stretches further. Overall 10x more concurrent users were served, without adding any more GPU compute and memory capacity.
NeuralMesh with Augmented Memory Grid reached approximately 2 million tokens/sec, compared to under 200,000 for the DRAM-only baseline; 10x higher token throughput.
NeuralMesh with Augmented Memory Grid served five billion tokens, compared to 700 million for the DRAM-only baseline, in a single one-hour, 2,400-user test.
For organizations running agentic workflows, DRAM saturation drains GPU capacity through constant recomputation, creating a direct hit on cost per token and ROI. With 7x more tokens served, the $/token cost is much lower. WEKA points out that, for product teams running real-time AI features, including search, summarization, code assist, and multi-turn agents, the throughput number determines the ceiling for how many users can be served, how fast features respond, and how much revenue the infrastructure can support. The 10x higher token throughput meant there was more output from every GPU in the cluster.
Using the WEKA software meant many more users could be supported, many more tokens processed and a significantly lower cost. WEKA CEO Liran Zvibel said: “Inference is bottlenecked by how much effective memory is available to GPUs. These results prove that AI token economics aren’t solved by hardware alone; they’re solved by eliminating the memory wall that has been the real ceiling on what existing hardware can do. NeuralMesh with Augmented Memory Grid running on OCI brings orders of magnitude more tokens to customers in an extremely cost-efficient way.”
OCI published the background, full benchmark methodology, system configuration, and results on its AI & Data Science blog.
NeuralMesh with Augmented Memory Grid is generally available to WEKA customers and on the Oracle Marketplace, with OCI as WEKA’s exclusive cloud launch partner. Organizations running long-context inference on OCI can deploy a validated, production-ready architecture today.