NVIDIA DSX OS Delivers Open, Modular Software for Operating AI Factories at Scale

NVIDIA released DSX OS, an open and modular software platform designed to operate and scale AI factories, aiming to improve efficiency and lower token production costs. The software provides open-source components that coordinate compute, infrastructure, and power systems, enabling operators to run up to 40% more GPUs within a fixed power budget. This release addresses the growing need for standardized, scalable infrastructure to manage the complex networks required for large-scale AI workload deployment.

AI is now essential infrastructure, powered by AI factories that generate intelligence in the form of tokens. As demand grows, these factories must scale faster, operate more efficiently, and lower the cost of intelligence across the five-layer stack https://blogs.nvidia.com/blog/ai-5-layer-cake/ : energy, chips, infrastructure, models, and applications. NVIDIA DSX https://www.nvidia.com/en-us/data-center/products/dsx/ platform provides the complete playbook for designing, simulating, building, and operating AI factories, aligning every layer of the stack across compute, software, facilities, and partner technologies through a common co-designed architecture. The DSX platform now includes DSX OS software https://docs.nvidia.com/dsx/home dsx-os to accelerate AI factory deployments and improve operational efficiency. DSX OS includes open source, modular software components and related NVIDIA technologies purpose-built for operating and scaling multi-tenant AI factories. Together, DSX OS components enable NVIDIA DSX’s AI factory ecosystem to adopt the latest in agentic AI infrastructure software across the full stack, improving tokens per watt https://blogs.nvidia.com/blog/revenue-potential-ai-factories/ and lowering token cost, accelerating deployment, and strengthening operational reliability and resiliency. Why DSX OS matters to the AI factory ecosystem AI factories must perform optimally in order to maximize the number of tokens they produce relative to the watts they consume, and bring real value to the operators. In order to achieve this, the complex network of components https://developer.nvidia.com/blog/scaling-token-factory-revenue-and-ai-efficiency-by-maximizing-performance-per-watt/ that goes into operating AI workloads at scale across datacenters must function in close harmony, requiring coordination across chips; systems; facilities infrastructure such as building management controls, cooling, and power distribution units; the power grid; the software and partner technologies running all of these; and the AI platforms and services running on top. DSX OS software is designed for this entire ecosystem of components and provides a comprehensive set of open and extensible technologies and capabilities that can be integrated and adopted into existing platforms and software. These capabilities have been designed and optimized around a common architecture, enabling all of the components involved to work together to deliver on three main outcomes that drive AI factory economics: 1 Faster time to revenue NVIDIA builds and operates infrastructure and platform software on NVIDIA DGX Cloud https://www.nvidia.com/en-us/data-center/dgx-cloud , and now this software is being released as open source. NVIDIA ecosystem partners can leverage these components to deliver AI services rather than rebuild from scratch, eliminating months of custom development. 2 Better efficiency Power is the limiting factor in an AI factory, and DSX connects power and grid behavior as part of the platform rather than as a facilities concern separated from the rest of the AI infrastructure. With DSX software, AI factories can run up to 40% more GPUs at peak energy efficiency within a fixed power budget, with minimal impact on inference workload performance. 3 Higher reliability and resiliency AI factories run continuous large-scale workloads through hardware faults, grid events, and operational changes. DSX OS shifts cluster operations from reactive alerting to automated remediation, keeps runtime versions consistent across regions, and gives operators fleet-wide visibility. How DSX OS enables gigawatt-scale AI factories The open source, modular components in DSX OS provide the foundational technologies for building and operating AI factories, and are designed to solve challenges unique to operating AI workloads efficiently and reliably at gigawatt scale. They do so by providing a co-designed set of core capabilities, including but not limited to standardized communication, power and efficiency optimization, provisioning and lifecycle operations, health monitoring and remediation, and intelligent platform services. More details about how DSX OS provides these capabilities follows: Standardized communication across the data center, enabled for agentic interfaces An AI factory spans compute, networking, power, and cooling systems that all need to interoperate seamlessly. DSX Exchange http://github.com/NVIDIA/dsx-exchange bridges these components with an MQTT-based IT/OT communication hub that makes facility-level signals such as grid events, thermal data, and power anomalies, visible to the software managing the rest of the AI factory, enabling components such as DSX Flex, MaxLPS, and partner software to react to each other’s state in real time, improving coordination and efficiency DSX OS software components across the full DSX stack will also provide MCP servers for provisioning, networking, observability, and more. Using these MCP servers, AI agents can discover the entire operational surface of the factory as a unified tool catalog, enabling them to interface across every system and perform cross-domain correlation. With an agentic AI factory, operators can easily connect a GPU health event with a thermal anomaly, or a network issue to a performance issue, or other potential scenarios. Power and efficiency optimization Static power allocation strands capacity, reactive cooling creates thermal oscillations, and disconnected IT/OT systems make grid events a manual fire drill. DSX MaxLPS includes software that treats power as a programmable resource by dynamically enforcing policies at the GPU, rack, cooling, and workload level, enabling AI factories to recover stranded power to run additional compute at optimal utilization. DSX Flex extends this beyond the factory walls, with libraries for connecting workloads to grid services so AI factories can automatically adapt to demand response, load shedding, and renewable energy availability. Partners including CoreWeave, Firmus, Lambda, Nscale, and Phaidra are deploying MaxLPS, while Emerald AI https://www.emeraldai.co , ENGIE, Silicon Valley Power, and UK National Grid https://www.ngpartners.com/stories/emerald-ai-whitepaper are leveraging DSX Flex. Provisioning and multi-tenant lifecycle operations At scale, provisioning is a continuous workflow: nodes cycle through tenant assignments, hardware is replaced, and every transition must be auditable and secure. NVIDIA Infra Controller NICo https://docs.nvidia.com/infra-controller/documentation/home makes this programmable with API-driven bare-metal lifecycle management and hardware-enforced tenant isolation through NVIDIA BlueField DPUs https://www.nvidia.com/en-us/networking/products/data-processing-unit/ and the NVIDIA DOCA Platform Framework https://www.nvidia.com/en-us/networking/products/software/doca/ . NVIDIA AI Cluster Runtime AICR https://developer.nvidia.com/blog/validate-kubernetes-for-gpu-infrastructure-with-layered-reproducible-recipes/ complements this by capturing validated runtime configurations as version-locked recipes, eliminating the configuration drift that causes silent failures across large fleets. IREN, OpenNebula Systems, Mirantis, Rafay, Red Hat, and Supermicro are among the partners integrating these components. Health monitoring and automation tooling In a large GPU fleet, hardware degradation is a daily occurrence, and the traditional alert-page-investigate cycle is too manual for minimizing impact on workloads. NVIDIA NVSentinel https://developer.nvidia.com/blog/automate-kubernetes-ai-cluster-health-with-nvsentinel/ provides Kubernetes-native GPU fault detection and automated remediation, cordoning unhealthy compute nodes and draining workloads in seconds rather than minutes or hours. NVIDIA Fleet Intelligence https://developer.nvidia.com/blog/introducing-nvidia-fleet-intelligence-for-real-time-gpu-fleet-visibility-and-optimization/ provides fleet-wide visibility, integrity verification, and health monitoring across global deployments. Lambda is an early adopter of Fleet Intelligence. Intelligent AI workload scheduling and platform services AI workloads need more than GPU access; they need topology-aware intelligent scheduling, distributed inference, and production APIs. KAI Scheduler https://developer.nvidia.com/blog/nvidia-open-sources-runai-scheduler-to-foster-community-collaboration/ and NVIDIA Run:ai https://www.nvidia.com/en-us/software/run-ai provide GPU-aware workload placement with fractional allocation and hierarchical quotas. NVIDIA Dynamo https://developer.nvidia.com/dynamo and NVIDIA Grove https://developer.nvidia.com/grove deliver distributed inference serving with disaggregated prefill/decode and per-stage autoscaling. NVIDIA Cloud Functions NVCF https://developer.nvidia.com/dgx-cloud/nvcf ties it together with unified APIs across inference, fine-tuning, and batch workloads with built-in multi-tenancy. Partners including Aible, Beyond AI, Bhashini, Crusoe, DCAI, Mirantis, Nebius, Rafay, Sarvam, Simplismart, Spectro Cloud, vCluster, Vultr, and Yotta are using many of these components in production. Getting started DSX OS components are available on GitHub and designed for incremental adoption and integration with existing software stacks. Start with the component that addresses your most immediate requirements, and build from there, leveraging the capabilities and technologies provided to accelerate your AI factory deployment and improve operational efficiency. Some examples are provided below: - IT/OT communications: DSX Exchange https://github.com/NVIDIA/dsx-exchange - Bare-metal lifecycle management and tenant isolation: NVIDIA Infra Controller https://github.com/NVIDIA/infra-controller-core and DOCA Platform Framework https://github.com/NVIDIA/doca-platform/ - Fleet visibility, health, and integrity: NVIDIA Fleet Intelligence https://github.com/NVIDIA/fleet-intelligence-agent/ - Unified AI inference APIs: NVIDIA Cloud Functions https://github.com/NVIDIA/nvcf Review NVIDIA DSX documentation https://docs.nvidia.com/dsx for more details about all of the components of DSX OS, implementation and reference design guides, quickstarts, and integration guidance.