I Stopped Paying for Idle GPUs - Scale-to-Zero AI Inference on OKE with KEDA

wpnews.pro

cd /news/artificial-intelligence/i-stopped-paying-for-idle-gpus-scale… · home › topics › artificial-intelligence › article

[ARTICLE · art-31240] src=dev.to ↗ pub=2026-06-17T14:37Z topic=artificial-intelligence verified=true sentiment=↑ positive

I Stopped Paying for Idle GPUs - Scale-to-Zero AI Inference on OKE with KEDA

An engineer on Oracle Cloud Infrastructure (OCI) built a scale-to-zero AI inference system on Oracle Kubernetes Engine (OKE) using KEDA to eliminate costs from idle GPUs. The system scales GPU pods down to zero when there is no traffic and spins them up on demand, reducing monthly GPU costs from over $2,000 to near zero for low-traffic environments. A lightweight proxy queues requests during cold starts, which can take 2-3 minutes for GPU pod provisioning.

read4 min views37 publishedJun 17, 2026

A single A10 GPU on OCI costs $1.52/hr. Running 24/7, that's $1,094/month. For a production inference service with steady traffic, that's fine. But I had a staging environment and a couple of internal tools that got maybe 20 requests per day. I was paying over $2,000/month for GPUs that sat idle 95% of the time.

The obvious solution: scale to zero when there's no traffic, spin up when a request comes in. KEDA does this on Kubernetes, but getting it to work properly with GPU pods took some figuring out.

With normal HTTP services, KEDA watches a metric (HTTP requests, queue depth, whatever), and Kubernetes can spin up a new pod in seconds. The user barely notices.

GPU pods are different:

So you can't just scale-to-zero and expect sub-second response times when traffic returns. The trade-off is cost savings vs. cold start latency. For my use case (internal tools, staging), a 2-3 minute cold start was acceptable.

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda \
  --namespace keda-system \
  --create-namespace

I'm using the nginx ingress controller's Prometheus metrics to track request rate. If you're using OCI's native load balancer, you'd use OCI Monitoring metrics instead.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-inference
  minReplicaCount: 0          # scale to zero
  maxReplicaCount: 3
  cooldownPeriod: 300          # wait 5 min of no traffic before scaling down
  pollingInterval: 15

  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: http_requests_total
        query: |
          sum(rate(nginx_ingress_controller_requests{
            namespace="inference",
            service="vllm-inference"
          }[2m]))
        threshold: "1"         # scale up if >1 req/sec averaged over 2 min
        activationThreshold: "0.1"  # activate from zero if any traffic

The key settings:

minReplicaCount: 0

— this is what enables scale-to-zerocooldownPeriod: 300

— 5 minutes of no traffic before scaling down (prevents flapping)activationThreshold: "0.1"

— even a trickle of traffic triggers scale-up from zeroWhen the pod scales from zero, there's a gap. The request that triggered the scale-up needs to wait for the pod to be ready. I handle this with a simple queue pattern:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-proxy
  namespace: inference
spec:
  replicas: 1    # always running, tiny resource footprint
  template:
    spec:
      containers:
        - name: proxy
          image: iad.ocir.io/mytenancy/inference-proxy:v1
          ports:
            - containerPort: 8080
          env:
            - name: BACKEND_URL
              value: "http://vllm-inference:8000"
            - name: TIMEOUT_SECONDS
              value: "180"    # wait up to 3 min for backend
          resources:
            requests:
              cpu: 50m
              memory: 64Mi

The proxy is a tiny Go service (always running, costs almost nothing) that:

func proxyHandler(w http.ResponseWriter, r *http.Request) {
    backendURL := os.Getenv("BACKEND_URL")
    timeout, _ := strconv.Atoi(os.Getenv("TIMEOUT_SECONDS"))

    deadline := time.Now().Add(time.Duration(timeout) * time.Second)
    backoff := 2 * time.Second

    for time.Now().Before(deadline) {
        resp, err := http.DefaultClient.Do(cloneRequest(r, backendURL))
        if err == nil {
            copyResponse(w, resp)
            return
        }
        time.Sleep(backoff)
        backoff = min(backoff*2, 15*time.Second)
    }

    http.Error(w, "inference backend unavailable, try again shortly", 503)
}

The slowest part of cold start isn't model — it's waiting for OKE to provision a GPU node when none exist. This takes 3-5 minutes.

My workaround: keep one GPU node always available, but let the inference pods on it scale to zero. The node costs money even when idle, but it's a single node vs. multiple. And when traffic comes in, the pod starts in ~90 seconds (model ) instead of 5+ minutes (node provisioning + model ).

oci ce node-pool update \
  --node-pool-id $GPU_NODE_POOL_ID \
  --node-config-details '{
    "size": 1,
    "placementConfigs": [...]
  }'

For staging environments where the 5-minute cold start is acceptable, I set the node pool to autoscale from 0 to 2 nodes and let OKE handle it.

My three GPU workloads (staging vLLM, internal summarizer, internal code review tool) were running 24/7 on three A10 instances:

Before	After
3x A10 always-on	1x A10 warm node + scale-to-zero pods
$3,282/month	~$1,094/month (warm node) + ~$50 (burst usage)
$3,282/month
~$1,144/month

65% savings. The internal tools scale up when someone uses them (a few times a day) and scale back down after 5 minutes of idle. The warm node means cold starts are 90 seconds, which is fine for internal users.

This works for internal tools, batch endpoints, staging environments, and anything where "please wait a moment" is an okay response.

Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. I'm also building keda-gpu-scaler for GPU-aware autoscaling. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

source & further reading

dev.to — original article From Agents to Infrastructure: Building Secure, Local-First AI Assistants with Go and Rust Claude Code in CI: Running Agentic Code Review, Test Generation, and Auto-Fix on Every Pull Request Day 23/30: Expose Tools with MCP

~/api · this article 200

$curl api.wpnews.pro/v1/news/i-stopped-paying-for-idl…

Read original on dev.to → dev.to/pavan_madduri/i-stopped-paying-for-idle-g…

mentioned entities

Oracle Cloud Infrastructure

KEDA

Oracle Kubernetes Engine

NVIDIA A10

Prometheus

nginx

metadata

slugi-stopped-paying-for-idle-gpus-scale-to-zero-ai-inference-on-oke-with-keda

topic#artificial-intelligence

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevHow Claude Code Broke My Git Wor…

next →Pramaana Labs raises $27M seed t…

── more in #artificial-intelligence 4 stories · sorted by recency

dev.to · 16 Jun · #artificial-intelligence

Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

dev.to · 2 Aug · #artificial-intelligence

From Agents to Infrastructure: Building Secure, Local-First AI Assistants with Go and Rust

dev.to · 2 Aug · #artificial-intelligence

AI Won't Replace DevOps Engineers—But These 7 Skills Will Make You Irreplaceable in 2026

github.com · 2 Aug · #artificial-intelligence

Foundational ternary-model inference and training – CUDA, CPU, BitNet/TQ

── more on @oracle cloud infrastructure 3 stories trending now

wpnews · 1 Aug · #ai-products

OpenAI Atlas Shuts Down August 9: Migration Guide

wpnews · 1 Aug · #ai-agents

Quality Isn't Accidental — Maker/Checker Separation and Automated Validation

wpnews · 1 Aug · #developer-tools

I Built a Portable AI Skill That Safely Upgrades .NET Applications

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required