# SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own

> Source: <https://blog.skypilot.co/skypilot-endpoints/>
> Published: 2026-06-23 16:00:00+00:00

**SkyPilot Endpoints is a next-gen LLM inference system designed for production-ready inference in multi-cluster environments.** A single YAML deploys the full serving stack - engine, autoscaler, gateway, certificates, metrics - and runs it across any number of Kubernetes clusters under one endpoint URL with a focus on performance and production-readiness.

## Multi-cluster inference made simple[#](#multi-cluster-inference-made-simple)

GPU supply is limited, and teams take capacity wherever they can get it - across clouds, regions, and on-prem.

But the Kubernetes-native LLM serving stack today (KServe, llm-d, Dynamo) is single-cluster, and operating it across the resulting fleet compounds both deployment and maintenance cost.

**SkyPilot Endpoints provides the cross-cluster control plane on top.** It sees registered Kubernetes clusters as one pool and handles:

**Placement.** On deploy, SkyPilot selects a cluster with sufficient GPU capacity for the configured replica count, accounting for preferences (region, cost, availability) declared in the YAML.**Scaling.** When autoscaling adds replicas beyond the home cluster’s capacity, additional replicas land on the next cluster with available GPUs.**Failure recovery.** On cluster failure, replicas are recreated on healthy clusters. The endpoint URL does not change.

Clients see one endpoint URL; the infra team manages one spec across the fleet.

### Deploy once. Place anywhere. Survive cluster failure.[#](#deploy-once-place-anywhere-survive-cluster-failure)

Below, SkyPilot autoscales replicas across clusters behind a single endpoint URL. Click the health dot on any cluster to terminate it and watch the replicas migrate.

## One YAML, one dashboard[#](#one-yaml-one-dashboard)

The components of the modern LLM inference stack are great in isolation: inference engines (vLLM, SGLang, TensorRT-LLM), serving frameworks (KServe, llm-d, Dynamo), autoscaling (KEDA), KV cache-aware routing (Gateway API + Inference Extension), TLS (cert-manager), metrics and tracing (Prometheus, Alloy).

Assembling them in a performant configuration is tedious per-deployment work - engine tuning, autoscaling wired to the right Prometheus query, KV cache-aware routing rules, certificate plumbing - and keeping the stack alive through engine upgrades, CRD migrations, and version-compatibility checks is a recurring tax.

**SkyPilot Endpoints replaces it with a single specification that deploys and manages inference across all your clusters.** Here’s a minimal spec for an endpoint:

```
name: glm-prod
model: zai-org/GLM-5.2
resources:
  accelerators: B200:8
replicas: 2
routing: kv_cache_aware
bash
$ sky endpoint up endpoint.yaml
```

Six lines to set up the whole stack from earlier - inference engine, serving framework, autoscaler, inference gateway, intelligent routing, metrics and more. SkyPilot handles setting up CRDs, wires up inference metrics to prometheus, installs KEDA when you turn on autoscaling and gives you a public (or private) URL. Works on every cluster you own.

Optional fields cover production knobs:

`engine:`

— choose between vLLM, SGLang and more. Passthrough for all engine flags (`max_model_len`

,`enforce_eager`

, …), or override the entrypoint for custom engines.`routing:`

— KV cache-aware routing using Gateway API Inference Extension or P2C.`prefill:`

— prefill/decode disaggregation (heterogeneous GPU types supported).`volumes:`

— Attach shared model cache across replicas for faster cold starts.`autoscaling:`

— scale on`kv_cache_utilization`

,`queue_depth`

or custom PromQL metrics with tunable up/down delays. Scale-to-zero supported.- Rolling updates, auth/TLS, gated-model auth and more.

The underlying stack builds on battle-tested open-source frameworks - KServe and llm-d. vLLM works out of the box, support for more inference engines coming soon.

**YAML in, dashboard out.** One dashboard for the whole fleet — not one per cluster:

**Overview**— Pod health and replica spread across clusters.** Serving metrics**— latency (TTFT, TPOT, end-to-end at p50/p95/p99), throughput (output tok/s, req/s), saturation (KV-cache util, queue depth, GPU util).**Logs**— per-pod engine logs, including sidecars and init containers.** Chat playground**— sanity-check the deployed model from your browser.

## Inference by day. Training by Night.[#](#inference-by-day-training-by-night)

The economics of running your own GPUs only work if you keep them busy. Unfortunately, [most organizations don’t](https://venturebeat.com/infrastructure/5-gpu-utilization-the-401-billion-ai-infrastructure-problem-enterprises-cant-keep-ignoring).

AI teams today split their GPU fleet into two fixed partitions: one for inference, one for training. However, Inference demand is inherently spiky — it surges during business hours and drops at night.

But those GPUs earmarked for inference can’t be touched by training, even when they’re sitting idle at 3 AM. Meanwhile, the training partition can’t expand to absorb that idle capacity. You’re paying full price for hardware you can’t fully use.

**SkyPilot improves your GPU utilization by providing a unified interface for both training and inference workloads.** With [Managed Jobs](https://docs.skypilot.co/en/latest/examples/managed-jobs.html) for training and the new SkyPilot Endpoints for inference, you manage both through the same system - and SkyPilot handles the dynamic GPU allocation automatically.

The key insight which drives utilization: **training workloads can be preemptible; inference workloads are latency-sensitive.**

**Inference replicas get high priority.** When they need to scale up, they get GPUs immediately.**Training jobs can be low priority.** They use all available GPUs, but gracefully yield when inference needs more.**SkyPilot handles the job recovery.** When a training job is preempted, SkyPilot automatically restarts it from its last checkpoint.

### Static Partitioning vs SkyPilot[#](#static-partitioning-vs-skypilot)

Here are the two approaches on the same 16-H100 cluster:

- The static side keeps 8 GPUs walled off for inference and 8 for training, no sharing.
- SkyPilot’s unified pool treats all 16 as one shared resource - inference borrows training’s GPUs when load rises, and gives them back when it falls.

Watch what happens as the request rate climbs: the static side caps out at 8 GPUs and starts dropping queries; the unified pool absorbs the burst.

### In action: Sharing 16 H100s between [inference] and [training][#](#in-action-sharing-16-h100s-between-a-stylecolorea580cinferencea-and-a-stylecolor2563ebtraininga)

Below is a real trace from a SkyPilot deployment sharing 16 H100s between **inference** and **training**.

- Both workloads start together: 14 GPUs running training jobs, 2 GPUs for inference.
- Then we hit the endpoint with bursty traffic.

Watch what happens - each tile is one H100 GPU:

**SkyPilot makes sure no GPUs sit idle.** When inference needed capacity, it got it instantly. When it didn’t, training used every available GPU. Training jobs that were preempted automatically resumed from their last checkpoint - without any manual intervention from researchers or infra teams.

## Designed for performance, reliability and observability[#](#designed-for-performance-reliability-and-observability)

SkyPilot Endpoints is already deployed in production for serving frontier models by top AI teams.

They choose it for its multi-cluster capabilities, but keep it for the performance, reliability and observability features that come with the stack:

**High performance.** KV cache-aware routing, prefill/decode disaggregation, speculative decoding and KV offloading available. LoRA support. Model caching cuts 235B-class cold starts to**under 1 min**.** Reliability.**Automatic failure recovery. Autoscaling on KV-cache util, queue depth, RPS or your own metric. Scale-to-zero support. Rolling updates with deployment versioning. Builds on battle-tested vLLM + llm-d + KServe stack.**Observability.** Replicas, traffic, KV-cache util, request latency, TTFT, TPOT, live logs + tracing, OTel/Datadog/Fluentbit/Promtail support, an in-browser playground — all in one dashboard, across every cluster you own.

## Early access[#](#early-access)

Want to try SkyPilot Endpoints? [Request access](https://docs.google.com/forms/d/e/1FAIpQLSe7Vr2K7g9uOHG4cMaQGU_dFobUU4vsm5b-iPWsu91U6e5WHg/viewform?usp=sharing&ouid=116243591910305721661) and we’ll be in touch.
