# Best GPU Optimization Tools for Kubernetes and AI Workloads (2026)

> Source: <https://cast.ai/blog/best-gpu-optimization-tools-for-kubernetes-ai/>
> Published: 2026-06-30 16:17:00+00:00

## What GPU optimization tools do

A [GPU optimization tool](https://cast.ai/gpu-optimization/) is any component that improves the ratio between GPU cost and GPU output. That includes node lifecycle automation, partition management, workload sharing, Spot orchestration, and cost attribution. No single tool covers all four. The effective ones work together inside a closed loop.

We call that loop the **GPU Cost Optimization Loop**:

**Measure**– collect per-GPU, per-pod utilization and cost data (DCGM Exporter, OpenCost)** Allocate**– right-size GPU partitions to workload demand (NVIDIA MIG, GPU Operator)** Share**– serve multiple workloads from one physical GPU (time-slicing, MIG)** Automate**– provision on demand, terminate on idle, shift to Spot when available (Karpenter, Cast AI)

### Key Takeaways

- Average GPU utilization across production Kubernetes clusters is
**5%**– CPU is 8%, memory is 20% ([Cast AI 2026 State of Kubernetes Optimization Report](https://cast.ai/reports/state-of-kubernetes-optimization/)). - Fewer than 2% of GPUs in those clusters run on Spot, despite AWS Spot saving 60–91% vs on-demand.
- NVIDIA MIG splits one A100 into up to 7 isolated partitions; combine with time-slicing for up to 28 pods per GPU.
- Inference workloads rarely exceed 40% GPU utilization; training can reach 85–95% – the optimization strategy differs significantly.
- The
[GPU Cost Optimization](https://cast.ai/blog/best-kubernetes-cost-optimization-tools/)Loop (Measure → Allocate → Share → Automate) requires multiple tools working together; Cast AI orchestrates the Automate layer and ties the others into a single control plane. - One team achieved 70%+ total savings vs SageMaker by combining time-slicing, Spot, and bin-packing on an ML inference workload (Cast AI customer case study, 2025 –
[contact us for the full case study](https://cast.ai/get-demo)).

## Why GPU cost is the new battleground

GPU economics are moving against you. The [Cast AI 2026 State of Kubernetes Optimization Report](https://cast.ai/reports/state-of-kubernetes-optimization/) measured utilization across tens of thousands of production clusters. The median GPU utilization: **5%**. For comparison, CPU sits at 8% and memory at 20% – both bad, but GPU is the worst by a factor of 4×.

At the same time, GPU hardware costs have only gone up. An H100 on `p5.48xlarge`

runs ~$12.29/GPU/hr on-demand. AWS cut P5 prices 45% in June 2025 – which sounds like relief until you realize H200 Capacity Blocks were raised 15% in January 2026, the first GPU price increase in 20 years. The cost floor is climbing. The utilization floor is not.

Laurent Gil, Co-Founder & President of Cast AI, put it directly: *“A GPU sitting idle costs dollars per hour. A CPU sitting idle costs cents. And 95% of GPU capacity is doing nothing. Autonomous optimization is the only rational response to infrastructure economics that are moving against you.”*

The Spot opportunity compounds this. Fewer than 2% of GPUs in the Cast AI dataset run on Spot. AWS Spot saves 60–91% vs on-demand on GPU instance families. GCP Preemptible saves 60–80%. Azure Spot saves 60–90%. An A100 (`p4d.24xlarge`

) at ~$4.10/GPU/hr on-demand costs roughly $1.60/hr on Spot at the low end. At scale, that gap becomes the largest line item in your AI infrastructure budget – and almost nobody is capturing it.

The best-case utilization Cast AI has observed: 136 H200s sustaining 49% average utilization. That’s a well-tuned inference fleet, including LLM inference services keeping GPU nodes alive 24/7 to serve sub-100ms requests. It’s also the exception, not the norm. The teams hitting those numbers use every tool in the stack deliberately.

## The four ways GPU money leaks, and what fixes each

GPU waste isn’t random. It concentrates in four patterns, each with a specific root cause and a specific fix. Before you reach for a tool, know which leak you’re actually solving.

| Leak | Root Cause | Fix | Tool(s) |
|---|---|---|---|
Idle GPU nodes running 24/7 | No node lifecycle automation – nodes provision at cluster creation and never terminate | GPU-aware autoscaling: scale to zero between jobs, provision on demand | Cast AI, Karpenter |
Oversized GPU allocation | Workloads request a full GPU but only need a fraction; MIG not configured | Multi-Instance GPU (MIG): partition one GPU into 2–7 isolated slices | NVIDIA GPU Operator (MIG Manager), Cast AI MIG support |
One workload per GPU, serially | Default Kubernetes device plugin: 1 GPU = 1 pod. No time-sharing. | Time-slicing: configure 1–48 virtual replicas per physical GPU | NVIDIA GPU Operator (time-slicing), Cast AI managed ConfigMap |
All workloads on on-demand | No Spot automation, fear of interruption, no fallback logic | Automated Spot with on-demand fallback; interruption-aware scheduling | Cast AI (Spot GPU automation), Karpenter (NodePool disruption budgets) |

These leaks are compounding. A team with no MIG, no time-slicing, and no Spot is paying full on-demand rates for 5% utilization. Fix all four simultaneously, not sequentially, and the math changes dramatically.

## The tools, by what they fix

The GPU tooling ecosystem is layered. Think of it as: observability at the base, partitioning in the middle, scheduling above that, and orchestration across all of it. Each layer builds on the one below. Cast AI is not a replacement for NVIDIA’s stack – it’s the automation layer that manages and coordinates it.

#### GPU Optimization Tool Comparison

| Tool | Category | What it fixes | Open source? | Kubernetes-native? | Multi-cloud? | Key limitation |
|---|---|---|---|---|---|---|
NVIDIA DCGM Exporter | Observability (Measure) | No GPU utilization visibility; exposes per-GPU metrics to Prometheus via DaemonSet | Yes | Yes (DaemonSet) | No (single-cluster) | No per-container attribution when time-slicing is enabled |
OpenCost | Cost attribution (Measure) | No per-pod GPU cost attribution; integrates cloud billing APIs down to the pod level | Yes (CNCF Incubating) | Yes | Yes (multi-cloud billing integration) | Single-cluster only; no multi-cluster aggregation (use Kubecost for that) |
Kubecost | Cost attribution (Measure) | No multi-cluster GPU cost visibility; enables cross-cluster showback and chargeback | No (commercial, freemium tier) | Yes | Yes | Commercial pricing for full feature set; now part of Harness cost management suite |
NVIDIA GPU Operator (MIG) | Partitioning (Allocate) | Oversized GPU allocation; partitions one GPU into up to 7 isolated memory and compute slices | Yes | Yes | No (single cluster) | Requires NVIDIA A100, H100, or H200; not supported on older GPU architectures |
NVIDIA GPU Operator (time-slicing) | Sharing (Share) | One workload per GPU limitation; enables 1–48 virtual replicas per physical GPU | Yes | Yes | No (single cluster) | No memory isolation between pods; one saturating workload degrades all co-tenants |
NVIDIA MPS | Sharing (Share) | Context-switch overhead between CUDA processes; allows concurrent GPU context sharing | Yes | Yes | No (single cluster) | Homogeneous CUDA workloads only; no memory isolation; not recommended for multi-tenant environments |
Karpenter | Automation (Automate) | Idle GPU nodes; provisions right-sized GPU instances in seconds and terminates when empty | Yes | Yes | No (AWS-primary; limited Azure/GCP provider maturity) | Single-cluster scope; no MIG or time-slicing ConfigMap lifecycle management |
Cast AI | Orchestration (Automate) | All four GPU cost leaks; Spot automation, MIG/time-slicing lifecycle, multi-cloud provisioning, and per-workload cost attribution in one control plane | No (commercial) | Yes | Yes (AWS, GCP, Azure) | Commercial product; requires connecting clusters to Cast AI control plane |

### Measure: NVIDIA DCGM Exporter + OpenCost

[NVIDIA DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) runs as a DaemonSet and exposes GPU metrics to Prometheus – utilization, memory, temperature, power draw, SM occupancy. Load the official Grafana dashboard (ID [12239](https://grafana.com/grafana/dashboards/12239)) and you get cluster-wide GPU visibility in under 10 minutes.

One hard limitation: DCGM Exporter does not associate metrics to individual containers when time-slicing is enabled. GPU time-slicing shares one physical device across multiple pods, and the kernel-level metrics don’t carry container context. You’ll see aggregate device utilization, not per-workload attribution. This is a known gap in the current DCGM architecture – not a Cast AI limitation, but one you need to plan around when designing your observability stack for shared-GPU clusters.

[OpenCost](https://github.com/opencost/opencost) (CNCF Incubating, opencost.io) provides free per-pod GPU cost allocation. It integrates with cloud billing APIs to assign actual GPU instance costs down to the pod level – something that’s surprisingly absent from most Kubernetes cost tools. For enterprise-scale cost attribution with multi-cluster aggregation, Kubecost (acquired by Harness in 2023, now part of Harness’s cost management suite) builds on the same foundation. If you’re running AI workloads across teams or projects and need showback/chargeback, OpenCost is the place to start.

The [FinOps Foundation’s AI GPU guide](https://www.finops.org/wg/finops-for-ai-overview/) covers cost allocation methodology in more depth for organizations formalizing their GPU FinOps practice.

### Allocate and Share: NVIDIA GPU Operator

The [NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) is the Helm chart that bootstraps everything: drivers, the device plugin, DCGM Exporter, MIG Manager, and the time-slicing configuration. If you’re not using the GPU Operator, you’re managing all of those components individually and you will eventually get them out of sync.

#### Installing the GPU Operator

```
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator
```

**MIG driver requirement:** Enabling MIG on an A100 requires NVIDIA driver version 450.80.02 or later. Verify with `nvidia-smi -q | grep 'Driver Version'`

before configuring the MIG strategy.

MIG (Multi-Instance GPU) partitions a physical GPU into isolated slices with dedicated memory and compute. On an A100, you can create up to 7 × 1g.5gb partitions per GPU. Each partition appears as a separate device to Kubernetes, meaning 7 separate pods can each get a guaranteed GPU fraction. MIG is supported on A100, A30, H100, H200, and newer NVIDIA architectures.

| GPU Model | Profile | Max per GPU | Memory per slice |
|---|---|---|---|
| A100 40GB | 1g.5gb | 7 | 5 GB |
| A100 40GB | 2g.10gb | 3 | 10 GB |
| A100 40GB | 3g.20gb | 2 | 20 GB |
| A100 40GB | 7g.40gb | 1 | 40 GB (full GPU) |
| A100 80GB | 1g.10gb | 7 | 10 GB |
| A100 80GB | 2g.20gb | 3 | 20 GB |
| A100 80GB | 3g.40gb | 2 | 40 GB |
| A100 80GB | 4g.40gb | 1 | 40 GB |
| A100 80GB | 7g.80gb | 1 | 80 GB (full GPU) |
| H100 80GB | 1g.10gb | 7 | 10 GB |
| H100 80GB | 2g.20gb | 3 | 20 GB |
| H100 80GB | 3g.40gb | 2 | 40 GB |
| H100 80GB | 4g.40gb | 1 | 40 GB |
| H100 80GB | 7g.80gb | 1 | 80 GB (full GPU) |

Time-slicing takes sharing further. Where MIG provides hard memory isolation, time-slicing is temporal sharing with no memory isolation – multiple pods take turns on the same GPU. Configure via a ConfigMap applied through the GPU Operator. You can set 1–48 replicas per GPU. In practice, 4–8 replicas is a reasonable starting point for inference workloads, and you adjust from there based on DCGM utilization data.

Combine MIG and time-slicing: 7 MIG partitions × 4 time-slice replicas = 28 virtual GPU devices from one A100. That changes the economics of inference serving entirely.

#### Time-slicing ConfigMap (4 replicas per GPU)

```
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4
```

#### Patch the ClusterPolicy to activate time-slicing

```
kubectl patch clusterpolicy/cluster-policy \
  -n gpu-operator \
  --type merge \
  -p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'
```

The ConfigMap alone has no effect until the ClusterPolicy references it. This patch tells the GPU Operator to use `time-slicing-config`

as the device plugin configuration on every node. The GPU Operator then propagates the configuration automatically – no manual daemonset restarts required.

**Latency caveat for inference SLOs:** Time-slicing introduces context-switch overhead that can meaningfully increase p99 latency for latency-sensitive inference workloads. For services with strict SLOs – particularly large language model inference endpoints with sub-100ms p99 targets – use MIG for hardware isolation instead of time-slicing.

**NVIDIA MPS:** NVIDIA MPS (Multi-Process Service) is an alternative sharing mechanism for homogeneous CUDA workloads – it allows multiple processes to share a single GPU context simultaneously, with lower context-switch overhead than time-slicing but no memory isolation. See the [NVIDIA MPS documentation](https://docs.nvidia.com/deploy/mps/index.html) for production considerations before adopting it in a multi-tenant environment.

### Automate: Karpenter

[Karpenter](https://karpenter.sh/docs/) is the node provisioner. It watches for unschedulable pods and provisions the right instance type in seconds – not the minutes it takes the cluster autoscaler. For GPU workloads, you define a NodePool that includes GPU instance families, sets a taint so only GPU-requesting pods land there, and configures Spot + on-demand as the capacity strategy.

#### Karpenter NodePool for GPU workloads

```
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-nodepool
spec:
  template:
    metadata:
      labels:
        node-type: gpu
    spec:
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p3.2xlarge
            - p3.8xlarge
            - p4d.24xlarge
            - p5.48xlarge
            - g5.xlarge
            - g5.4xlarge
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-nodeclass
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m
  limits:
    nvidia.com/gpu: "100"
```

#### EC2NodeClass for GPU nodes

```
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-nodeclass
spec:
  amiFamily: AL2023
  amiSelectorTerms:
    - alias: al2023@latest
  role: KarpenterNodeRole
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  tags:
    Name: karpenter-gpu-node
```

The `consolidateAfter: 5m`

setting tells Karpenter to terminate GPU nodes that have been empty for 5 minutes. That’s the node lifecycle fix for idle GPUs – and it’s the lever most teams don’t pull because the default cluster autoscaler takes 10+ minutes and frequently gets it wrong on GPU instances.

#### GPU pod spec

```
apiVersion: v1
kind: Pod
metadata:
  name: gpu-inference
spec:
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  containers:
    - name: inference
      image: nvcr.io/nvidia/tritonserver:24.08-py3
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "16Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "16Gi"
          cpu: "4"
```

Note the toleration matching the taint on the NodePool. Without it, the pod won’t schedule onto GPU nodes. Also note that `requests`

and `limits`

should match for GPU resources – partial GPU requests aren’t meaningful in the standard device plugin model (MIG and time-slicing change this via virtual device advertisement).

### Orchestrate everything: Cast AI

After you’ve deployed DCGM Exporter, configured the GPU Operator, and set up Karpenter, you have four separate systems that don’t talk to each other. DCGM tells you GPU utilization; the GPU Operator manages drivers and partition configs on each node; Karpenter provisions and terminates nodes for a single cluster. None of them automatically adjusts MIG profiles when workload demand changes, none manages Spot fallback across clouds, and none gives you a single view of GPU cost across multiple clusters. That’s the gap.

Cast AI’s **OMNI Compute for AI** is the orchestration layer that closes it. It connects to your existing Kubernetes clusters – including clusters already running Karpenter – and adds GPU-aware bin-packing, automated MIG and time-slicing lifecycle management, Spot automation with on-demand fallback, and multi-cloud provisioning on top of whatever you already have. Karpenter handles node provisioning for a single cluster well; Cast AI extends that capability across AWS, GCP, and Azure while managing the NVIDIA configuration layers that Karpenter doesn’t touch.

**Karpenter and Cast AI are complementary, not competing.** If you’re already running Karpenter, Cast AI doesn’t replace it. It layers GPU-aware scheduling, MIG/time-slicing ConfigMap lifecycle management, and multi-cloud Spot orchestration on top – giving you capabilities that Karpenter’s node-provisioning scope doesn’t cover.

The MIG support covers A100, A30, H100, H200, B200, and GB200 on EKS and GKE. Cast AI manages the MIG ConfigMap lifecycle – it applies the right partition profile based on workload demand and cleans up when no longer needed. The same applies to time-slicing: 1–48 replicas per GPU, ConfigMap managed automatically. No manual kubectl patching when you want to change replica counts mid-deployment.

Cast AI also provides GPU metrics and cost attribution per workload – directly addressing the DCGM Exporter gap on time-sliced clusters. Where DCGM loses container context, Cast AI’s cost attribution layer maintains it via the Kubernetes scheduler and billing API integration, so you always know which team, job, or service drove which portion of the GPU bill.

Cast AI already supports DRA (Dynamic Resource Allocation)-based GPU scheduling via Kubernetes ResourceClaims – connect your cluster to have DRA-aware provisioning activated automatically.

If you want to see what the Spot + time-slicing + bin-packing combination does to your actual GPU bill: [book a demo with the Cast AI team](https://cast.ai/get-demo). They run the analysis against your cluster data.

## How to choose for inference vs training

Inference and training have fundamentally different GPU utilization profiles. Training workloads are sustained and compute-continuous – GPU utilization runs near 85–95% for the full job duration, with no meaningful idle periods once a run starts. Inference is continuous but low-intensity – rarely exceeding 40% per GPU for most model sizes, with significant idle periods between requests. LLM inference serving is the exception: large language model endpoints keep GPU memory fully allocated even between requests, creating a different cost profile than smaller model inference. The optimization strategy diverges here.

| Decision | Inference workloads | Training workloads |
|---|---|---|
| High priority – inference rarely saturates a full GPU; MIG or time-slicing dramatically improves utilization | Low priority for distributed training (needs full GPU); consider for single-GPU fine-tuning jobs with appropriate profiles (see MIG row below) |
Spot GPU instances | Use with caution – stateless inference servers tolerate interruption well; stateful model servers need fallback logic | Use for fault-tolerant training jobs (PyTorch elastic training, Ray); avoid for jobs without checkpointing |
Node lifecycle (scale-to-zero) | Critical – inference nodes can go idle between low-traffic windows; terminate aggressively | Critical – terminate immediately after job completion; sustained training runs have clear end states |
GPU instance size | Smaller instances (g5.xlarge, g5.4xlarge) with time-slicing often beat one p4d.24xlarge for inference throughput/cost | Largest available GPU per node; NVLink/NVSwitch topology matters for multi-GPU jobs |
MIG partitioning | Use 1g.5gb or 2g.10gb profiles for small inference models; 4g.20gb for models that need more VRAM | Smaller profiles (1g.5gb, 2g.10gb) create memory bandwidth bottlenecks for fine-tuning. Use 4g.40gb or 7g.40gb for fine-tuning workloads, or avoid MIG entirely. Avoid MIG for distributed training – partition boundaries block NVLink communication. |
Primary optimization lever | Sharing + Spot (density × price) | Spot + bin-packing (price × scheduling speed) |
Recommended toolchain | GPU Operator (time-slicing) + Cast AI (Spot fallback + node lifecycle) + OpenCost (attribution). Inference frameworks like vLLM and NVIDIA TensorRT-LLM handle dynamic batching, continuous batching, and KV cache management to maximize throughput within your GPU allocation – these are complementary to infrastructure-level optimization. | Karpenter or Cast AI (fast provisioning + Spot) + DCGM Exporter (utilization monitoring) |

**Multi-node distributed training: network fabric matters as much as GPU choice.** For multi-node distributed training, network fabric is as important as GPU selection. NVIDIA InfiniBand (200 Gb/s HDR or 400 Gb/s NDR) and AWS EFA (Elastic Fabric Adapter) provide the RDMA bandwidth needed for efficient all-reduce operations across nodes – the collective communication pattern that dominates distributed training. Without high-bandwidth low-latency interconnects, GPUs wait on network I/O and effective utilization collapses even on otherwise well-tuned workloads. Verify your instance type supports EFA (p4d, p5, and trn1 families on AWS) before planning multi-node training runs.

AMD’s MI300X and MI325X are emerging alternatives for inference scale, with strong memory bandwidth characteristics; the GPU Operator supports AMD ROCm workloads via the AMD GPU Operator (separate from the NVIDIA GPU Operator).

## FAQ

### What is the average GPU utilization in Kubernetes production clusters?

According to the [Cast AI 2026 State of Kubernetes Optimization Report](https://cast.ai/reports/state-of-kubernetes-optimization/), the average GPU utilization across tens of thousands of production clusters is **5%**. CPU averages 8% and memory 20%. GPU is the worst-utilized resource in the stack by a significant margin, and it’s the most expensive per wasted unit.

### What is the difference between MIG and time-slicing for GPU sharing?

MIG (Multi-Instance GPU) creates hardware-isolated partitions with dedicated memory, compute, and cache per slice. Each MIG partition appears as a separate device. It’s the right choice when workloads need memory isolation, predictable latency, or fault isolation. Time-slicing shares one physical GPU temporally – pods take turns on the same device with no memory protection between them. It’s simpler to configure and supports more replicas (up to 48 per GPU), but any workload can saturate the shared GPU and affect its neighbors. For inference serving, time-slicing is usually sufficient. For shared multi-tenant environments, MIG is safer. See the [NVIDIA MIG User Guide](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/) for hardware requirements.

### How much can Spot instances save on GPU costs?

AWS Spot saves 60–91% vs on-demand on GPU instance families. GCP Preemptible saves 60–80%; Azure Spot saves 60–90%. An A100 (`p4d.24xlarge`

) at ~$4.10/GPU/hr on-demand drops to under $2/hr on Spot at common discount levels. The H100 (`p5.48xlarge`

) runs ~$12.29/GPU/hr on-demand. Despite these savings, fewer than 2% of GPUs in the Cast AI dataset run on Spot – the opportunity is almost entirely uncaptured across the industry. The primary barrier is interruption handling, which automated Spot orchestration with on-demand fallback (as in Cast AI and Karpenter) addresses directly.

### Does DCGM Exporter work with GPU time-slicing?

DCGM Exporter works at the GPU device level and exposes metrics to Prometheus, but it does not associate metrics to individual containers when time-slicing is enabled. You’ll see aggregate GPU utilization for the physical device, not per-pod breakdowns. This is a limitation of how the NVIDIA kernel driver exposes time-slice metrics – not a gap in DCGM’s implementation. For per-workload attribution on time-sliced clusters, you need a cost attribution layer (OpenCost or Cast AI’s built-in attribution) that correlates Kubernetes scheduling data with GPU device metrics.

### What tools are needed to implement the GPU Cost Optimization Loop?

The GPU Cost Optimization Loop – Measure → Allocate → Share → Automate – requires at minimum: DCGM Exporter (measure), NVIDIA GPU Operator with MIG Manager (allocate), NVIDIA GPU Operator time-slicing configuration (share), and an autoscaler with Spot support (automate). Karpenter covers the automate layer for single-cluster AWS deployments. Cast AI covers automate across multi-cloud deployments and also handles MIG and time-slicing ConfigMap lifecycle management, collapsing several manual configuration steps into one control plane. OpenCost handles the cost attribution component that feeds back into the Measure step.

### Is it safe to run training workloads on Spot GPU instances?

Yes, with the right architecture. Spot interruptions give you a 2-minute warning on AWS. Training jobs that implement checkpointing (saving model state every N steps) can resume from the last checkpoint on a new instance – losing at most a few minutes of compute, not the entire run. PyTorch’s elastic training and Ray Train handle this natively. Jobs without checkpointing are high-risk on Spot and should run on on-demand with Spot as a stretch capacity option only. Cast AI’s Spot automation triggers an on-demand fallback automatically on interruption, so the job survives without manual intervention – but you still need checkpointing in your training code for zero data loss.