AKS Goes Bare Metal to Capture the AI Workload

wpnews.pro

Microsoft strips the hypervisor and expands fleet management to turn Kubernetes into the default runtime for distributed AI.

Emeka Okafor

Kubernetes was built to orchestrate lightweight, stateless microservices. Today, the industry is trying to force-fit massive, stateful, GPU-bound artificial intelligence workloads into that same architecture. At Microsoft Build 2026, the company made its play to resolve this tension, rolling out a suite of upgrades to Azure Kubernetes Service (AKS) designed to turn the orchestrator into a first-class AI runtime.

The strategy is clear. Instead of letting developers build bespoke, fragmented infrastructure stacks for machine learning, Microsoft wants to consolidate everything under the Kubernetes API. To do this, they are stripping away the virtualization layer with AKS on Bare Metal, expanding multi-cluster orchestration via Azure Kubernetes Fleet Manager, and introducing native AI tooling like the Kubernetes AI Toolchain Operator (KAITO). This is a calculated bet that the future of AI engineering belongs to platform teams, not just data scientists.

Stripping the Hypervisor: The Bare Metal Bet #

For high-performance AI training and low-latency inference, virtualization is a tax. Traditional cloud Kubernetes nodes run inside virtual machines, introducing a hypervisor layer that sits between the container and the physical hardware. While this provides excellent isolation and flexibility, it degrades access to hardware-level interconnects like NVLink and Remote Direct Memory Access (RDMA).

To address this, Microsoft introduced AKS on Bare Metal in public preview. By deploying Kubernetes clusters directly on physical hardware without a hypervisor, workloads gain direct, unimpeded access to the underlying silicon. This is particularly critical for large language model (LLM) training, where nodes must constantly exchange massive weight matrices over high-speed networks.

However, running bare-metal AKS is not a simple drop-in replacement. It requires Azure Arc to bridge the gap between physical hardware and the Azure control plane. Platform engineers must manage physical provisioning, and they lose the instant elasticity of virtual machine scale sets. The trade-off is raw performance and predictable latency versus the operational convenience of the public cloud. For organizations running massive inference pipelines where a small performance gain translates to significant cost savings, the operational overhead of bare metal is a price worth paying.

Managing the Fleet: Multi-Cluster Orchestration #

As Kubernetes adoption grows, managing single, massive clusters is giving way to operating distributed fleets. A survey by The Futurum Group highlights this shift in enterprise adoption:

pie title "Kubernetes Workload Adoption"
    "Some Workloads" : 41
    "Majority of Workloads" : 19
    "Other / None" : 40

Operating dozens of clusters across different regions, on-premises data centers, and edge environments quickly becomes an operational nightmare. Microsoft addressed this by announcing the general availability of Azure Kubernetes Fleet Manager for Arc-enabled clusters.

Rather than treating clusters as isolated islands, Fleet Manager provides a centralized control plane to enforce policies, manage workload placement, and orchestrate staged rollouts. A key addition is cross-cluster networking powered by a managed Cilium cluster mesh. This allows services running in different clusters to communicate directly, backed by a global service registry for service discovery.

Fleet Manager also introduces multi-cluster auto-upgrades. Upgrading Kubernetes is historically risky; automating this across a fleet with built-in eviction controls and rollout strategies mitigates the risk of cascading failures. For developers, this means the underlying infrastructure behaves more like a single, global utility rather than a collection of fragile, bespoke environments.

Shadow GPS — know where it is, always Real-time GPS tracking for vehicles, gear and loved ones. No monthly contracts.

The Developer Angle: Running AI Workloads on AKS #

For application developers, the most immediate impact of these updates is how they change the workflow for deploying and scaling AI models. Historically, deploying an LLM required writing complex manifests, configuring GPU drivers, and manually setting up inference servers like vLLM.

Microsoft is attempting to abstract this complexity through KAITO and AI Runway. KAITO, which has been integrated with Retrieval-Augmented Generation (RAG) capabilities and default vLLM support, acts as an intelligent controller. When a developer specifies a model, KAITO handles the GPU provisioning, selects the optimal runtime, and configures the networking.

To scale these workloads dynamically, AKS integrates with Kubernetes Event-driven Autoscaling (KEDA) and the Gateway API. For distributed training, the public preview of Anyscale on Azure brings managed Ray orchestration directly into AKS.

Consider a typical workflow for deploying an open-source model using KAITO. Instead of writing low-level container specifications, a developer defines a custom resource:

apiVersion: kaito.sh/v1alpha1
kind: Workspace
metadata:
  name: llama-3-inference
spec:
  preset:
    name: llama-3-8b-instruct
  inference:
    vllm:
      enabled: true
  resource:
    instanceType: Standard_NC24ads_A100_v4
    labelSelector:
      matchLabels:
        apps: kaito-inference

Under the hood, the operator validates the GPU requirements, provisions the node pool (utilizing Managed System Node Pools in AKS Automatic to keep system services from competing with the GPU workload), and exposes the endpoint.

The trade-off here is control versus convenience. While tools like AI Runway and KAITO make deployment simple, they introduce opinionated defaults. If your workload requires highly customized CUDA kernels or non-standard model architectures, you may find yourself fighting the abstractions.

The Reality Check #

Are these features ready for production? The answer depends on which part of the announcement you look at.

The fleet management capabilities, Azure Container Linux, and Managed System Node Pools are generally available and ready for enterprise adoption. They solve real, immediate operational pain points around multi-cluster governance and OS maintenance.

The AI-specific infrastructure pieces, however, require a more cautious approach. AKS on Bare Metal and Anyscale on Azure remain in public preview. Bare-metal deployments via Azure Arc introduce significant hardware-management responsibilities that many cloud-native teams are not equipped to handle.

Microsoft's updates show that Kubernetes is no longer just for web applications. By addressing the physical layer with bare metal and the application layer with KAITO, AKS is positioning itself as the pragmatic choice for enterprise AI. Platform teams must resist the urge to adopt every new abstraction immediately; the key is to adopt the mature fleet management tools today while carefully prototyping the bare-metal and Ray integrations for tomorrow's heavy-duty workloads.

Sources & further reading #

Microsoft Expands Azure Kubernetes Service with Bare Metal, Fleet Management and AI Infrastructure— infoq.com - Best of 2025: Microsoft Simplifies Kubernetes Management with AI Integration - Cloud Native Now— cloudnativenow.com - What is Azure Kubernetes Service on bare metal? (preview) - AKS enabled by Azure Arc | Microsoft Learn— learn.microsoft.com - Azure Kubernetes Service (AKS) | Microsoft Azure— azure.microsoft.com

Emeka Okafor· Security Editor

Emeka has spent over a decade tracking threat actors, vulnerability disclosures, and the evolving landscape of application security, bringing a sharp continent-spanning perspective to his reporting. He's known for translating dense CVE advisories into clear, actionable context that developers and security teams alike actually read.

Discussion 1 #

i love how aks is going bare metal, it's a total game changer for ai workloads - no more hypervisor overhead means we can finally get the performance we need for our machine learning apps 🚀

source & further reading

devclubhouse.com — original article The distillation attack no API can fully block The Thermodynamics of NVIDIA's 45°C Liquid Cooling Ditching ANTLR: How PostHog Rebuilt Its SQL Parser for a 70x Speedup