{"slug": "i-stopped-paying-for-idle-gpus-scale-to-zero-ai-inference-on-oke-with-keda", "title": "I Stopped Paying for Idle GPUs - Scale-to-Zero AI Inference on OKE with KEDA", "summary": "An engineer on Oracle Cloud Infrastructure (OCI) built a scale-to-zero AI inference system on Oracle Kubernetes Engine (OKE) using KEDA to eliminate costs from idle GPUs. The system scales GPU pods down to zero when there is no traffic and spins them up on demand, reducing monthly GPU costs from over $2,000 to near zero for low-traffic environments. A lightweight proxy queues requests during cold starts, which can take 2-3 minutes for GPU pod provisioning.", "body_md": "A single A10 GPU on OCI costs $1.52/hr. Running 24/7, that's $1,094/month. For a production inference service with steady traffic, that's fine. But I had a staging environment and a couple of internal tools that got maybe 20 requests per day. I was paying over $2,000/month for GPUs that sat idle 95% of the time.\n\nThe obvious solution: scale to zero when there's no traffic, spin up when a request comes in. KEDA does this on Kubernetes, but getting it to work properly with GPU pods took some figuring out.\n\nWith normal HTTP services, KEDA watches a metric (HTTP requests, queue depth, whatever), and Kubernetes can spin up a new pod in seconds. The user barely notices.\n\nGPU pods are different:\n\nSo you can't just scale-to-zero and expect sub-second response times when traffic returns. The trade-off is cost savings vs. cold start latency. For my use case (internal tools, staging), a 2-3 minute cold start was acceptable.\n\n```\nhelm repo add kedacore https://kedacore.github.io/charts\nhelm install keda kedacore/keda \\\n  --namespace keda-system \\\n  --create-namespace\n```\n\nI'm using the nginx ingress controller's Prometheus metrics to track request rate. If you're using OCI's native load balancer, you'd use OCI Monitoring metrics instead.\n\n```\n# prometheus-scaledobject.yaml\napiVersion: keda.sh/v1alpha1\nkind: ScaledObject\nmetadata:\n  name: vllm-scaler\n  namespace: inference\nspec:\n  scaleTargetRef:\n    name: vllm-inference\n  minReplicaCount: 0          # scale to zero\n  maxReplicaCount: 3\n  cooldownPeriod: 300          # wait 5 min of no traffic before scaling down\n  pollingInterval: 15\n\n  triggers:\n    - type: prometheus\n      metadata:\n        serverAddress: http://prometheus.monitoring:9090\n        metricName: http_requests_total\n        query: |\n          sum(rate(nginx_ingress_controller_requests{\n            namespace=\"inference\",\n            service=\"vllm-inference\"\n          }[2m]))\n        threshold: \"1\"         # scale up if >1 req/sec averaged over 2 min\n        activationThreshold: \"0.1\"  # activate from zero if any traffic\n```\n\nThe key settings:\n\n`minReplicaCount: 0`\n\n— this is what enables scale-to-zero`cooldownPeriod: 300`\n\n— 5 minutes of no traffic before scaling down (prevents flapping)`activationThreshold: \"0.1\"`\n\n— even a trickle of traffic triggers scale-up from zeroWhen the pod scales from zero, there's a gap. The request that triggered the scale-up needs to wait for the pod to be ready. I handle this with a simple queue pattern:\n\n```\n# queue-proxy.yaml — lightweight proxy that holds requests during cold start\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: inference-proxy\n  namespace: inference\nspec:\n  replicas: 1    # always running, tiny resource footprint\n  template:\n    spec:\n      containers:\n        - name: proxy\n          image: iad.ocir.io/mytenancy/inference-proxy:v1\n          ports:\n            - containerPort: 8080\n          env:\n            - name: BACKEND_URL\n              value: \"http://vllm-inference:8000\"\n            - name: TIMEOUT_SECONDS\n              value: \"180\"    # wait up to 3 min for backend\n          resources:\n            requests:\n              cpu: 50m\n              memory: 64Mi\n```\n\nThe proxy is a tiny Go service (always running, costs almost nothing) that:\n\n```\nfunc proxyHandler(w http.ResponseWriter, r *http.Request) {\n    backendURL := os.Getenv(\"BACKEND_URL\")\n    timeout, _ := strconv.Atoi(os.Getenv(\"TIMEOUT_SECONDS\"))\n\n    deadline := time.Now().Add(time.Duration(timeout) * time.Second)\n    backoff := 2 * time.Second\n\n    for time.Now().Before(deadline) {\n        resp, err := http.DefaultClient.Do(cloneRequest(r, backendURL))\n        if err == nil {\n            copyResponse(w, resp)\n            return\n        }\n        time.Sleep(backoff)\n        backoff = min(backoff*2, 15*time.Second)\n    }\n\n    http.Error(w, \"inference backend unavailable, try again shortly\", 503)\n}\n```\n\nThe slowest part of cold start isn't model loading — it's waiting for OKE to provision a GPU node when none exist. This takes 3-5 minutes.\n\nMy workaround: keep one GPU node always available, but let the inference pods on it scale to zero. The node costs money even when idle, but it's a single node vs. multiple. And when traffic comes in, the pod starts in ~90 seconds (model loading) instead of 5+ minutes (node provisioning + model loading).\n\n```\n# GPU node pool with min 1 node (always warm)\noci ce node-pool update \\\n  --node-pool-id $GPU_NODE_POOL_ID \\\n  --node-config-details '{\n    \"size\": 1,\n    \"placementConfigs\": [...]\n  }'\n```\n\nFor staging environments where the 5-minute cold start is acceptable, I set the node pool to autoscale from 0 to 2 nodes and let OKE handle it.\n\nMy three GPU workloads (staging vLLM, internal summarizer, internal code review tool) were running 24/7 on three A10 instances:\n\n| Before | After |\n|---|---|\n| 3x A10 always-on | 1x A10 warm node + scale-to-zero pods |\n| $3,282/month | ~$1,094/month (warm node) + ~$50 (burst usage) |\n$3,282/month |\n~$1,144/month |\n\n65% savings. The internal tools scale up when someone uses them (a few times a day) and scale back down after 5 minutes of idle. The warm node means cold starts are 90 seconds, which is fine for internal users.\n\nThis works for internal tools, batch endpoints, staging environments, and anything where \"please wait a moment\" is an okay response.\n\n*Pavan Madduri — Oracle ACE Associate, CNCF Golden Kubestronaut. I'm also building keda-gpu-scaler for GPU-aware autoscaling. GitHub | LinkedIn | Website | Google Scholar | ResearchGate*", "url": "https://wpnews.pro/news/i-stopped-paying-for-idle-gpus-scale-to-zero-ai-inference-on-oke-with-keda", "canonical_source": "https://dev.to/pavan_madduri/i-stopped-paying-for-idle-gpus-scale-to-zero-ai-inference-on-oke-with-keda-3oen", "published_at": "2026-06-17 14:37:11+00:00", "updated_at": "2026-06-17 14:51:26.030041+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "developer-tools", "ai-infrastructure", "mlops"], "entities": ["Oracle Cloud Infrastructure", "KEDA", "Oracle Kubernetes Engine", "NVIDIA A10", "Prometheus", "nginx", "Go"], "alternates": {"html": "https://wpnews.pro/news/i-stopped-paying-for-idle-gpus-scale-to-zero-ai-inference-on-oke-with-keda", "markdown": "https://wpnews.pro/news/i-stopped-paying-for-idle-gpus-scale-to-zero-ai-inference-on-oke-with-keda.md", "text": "https://wpnews.pro/news/i-stopped-paying-for-idle-gpus-scale-to-zero-ai-inference-on-oke-with-keda.txt", "jsonld": "https://wpnews.pro/news/i-stopped-paying-for-idle-gpus-scale-to-zero-ai-inference-on-oke-with-keda.jsonld"}}