{"slug": "deploying-vllm-on-oke-with-nvidia-a10-gpus-the-20-minute-setup-nobody-talks", "title": "Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About", "summary": "A developer deployed a Llama 3 inference endpoint on Oracle Cloud Infrastructure's OKE cluster using vLLM and NVIDIA A10 GPUs in about 20 minutes. The setup cost $1.52/hr on-demand or $0.46/hr preemptible, significantly cheaper than AWS or Azure equivalents. The process involved creating an OKE cluster, a GPU node pool with the A10 shape, installing the NVIDIA device plugin, and deploying vLLM via a Kubernetes deployment.", "body_md": "Last month I needed to stand up a Llama 3 inference endpoint for an internal tool. The requirements were simple: OpenAI-compatible API, auto-scaling, and it couldn't cost more than the team's coffee budget. AWS wanted $3.06/hr for a `g5.xlarge`\n\n. Azure quoted something similar.\n\nThen I looked at OCI's GPU shapes. `VM.GPU.A10.1`\n\n— a single NVIDIA A10 with 24GB VRAM — at $1.52/hr on-demand. Half the price. And on preemptible? $0.46/hr. That's a latte.\n\nHere's how I got vLLM running on OKE in about 20 minutes.\n\nIf you already have an OKE cluster, skip ahead. If not, this is the fastest path:\n\n```\n# Create a VCN (or use an existing one)\noci network vcn create \\\n  --compartment-id $COMPARTMENT_ID \\\n  --cidr-blocks '[\"10.0.0.0/16\"]' \\\n  --display-name \"ai-inference-vcn\"\n\n# Create the OKE cluster\noci ce cluster create \\\n  --compartment-id $COMPARTMENT_ID \\\n  --name \"inference-cluster\" \\\n  --vcn-id $VCN_ID \\\n  --kubernetes-version \"v1.30.1\" \\\n  --service-lb-subnet-ids \"[$PUBLIC_SUBNET_ID]\"\n```\n\nThe key part is the GPU node pool. OCI has several GPU shapes, but for inference the A10 is the sweet spot:\n\n| Shape | GPU | VRAM | $/hr (on-demand) | $/hr (preemptible) |\n|---|---|---|---|---|\n| VM.GPU.A10.1 | 1x A10 | 24 GB | ~$1.52 | ~$0.46 |\n| VM.GPU.A10.2 | 2x A10 | 48 GB | ~$3.04 | ~$0.91 |\n| BM.GPU.A100-v2.8 | 8x A100 | 640 GB | ~$26.52 | N/A |\n\nFor a 7B parameter model, a single A10 is plenty. For 70B, you'd want 2xA10 or the A100 bare metal.\n\n```\n# Create the GPU node pool\noci ce node-pool create \\\n  --cluster-id $CLUSTER_ID \\\n  --compartment-id $COMPARTMENT_ID \\\n  --name \"gpu-a10-pool\" \\\n  --node-shape \"VM.GPU.A10.1\" \\\n  --size 1 \\\n  --node-config-details \\\n    '{\"size\": 1, \"placementConfigs\": [{\"availabilityDomain\": \"'\"$AD\"'\", \"subnetId\": \"'\"$WORKER_SUBNET_ID\"'\"}]}' \\\n  --node-source-details \\\n    '{\"sourceType\": \"IMAGE\", \"imageId\": \"'\"$GPU_IMAGE_ID\"'\"}'\n```\n\nMake sure you use the **OKE GPU image** — it comes with NVIDIA drivers and `nvidia-container-toolkit`\n\npre-installed. You don't want to deal with driver installation yourself. Trust me.\n\nOKE's GPU images already include the drivers, but Kubernetes needs the device plugin to expose GPUs as a schedulable resource:\n\n```\n# nvidia-device-plugin.yaml\napiVersion: apps/v1\nkind: DaemonSet\nmetadata:\n  name: nvidia-device-plugin-daemonset\n  namespace: kube-system\nspec:\n  selector:\n    matchLabels:\n      name: nvidia-device-plugin-ds\n  template:\n    metadata:\n      labels:\n        name: nvidia-device-plugin-ds\n    spec:\n      tolerations:\n      - key: nvidia.com/gpu\n        operator: Exists\n        effect: NoSchedule\n      containers:\n      - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1\n        name: nvidia-device-plugin-ctr\n        env:\n        - name: FAIL_ON_INIT_ERROR\n          value: \"false\"\n        securityContext:\n          allowPrivilegeEscalation: false\n          capabilities:\n            drop: [\"ALL\"]\n        volumeMounts:\n        - name: device-plugin\n          mountPath: /var/lib/kubelet/device-plugins\n      volumes:\n      - name: device-plugin\n        hostPath:\n          path: /var/lib/kubelet/device-plugins\nkubectl apply -f nvidia-device-plugin.yaml\n```\n\nVerify GPUs show up:\n\n```\nkubectl get nodes -o json | jq '.items[].status.capacity[\"nvidia.com/gpu\"]'\n# \"1\"\n```\n\nIf that says `\"1\"`\n\n, you're golden.\n\nvLLM's Docker image is the easiest way to run it. No pip installs, no dependency conflicts, no wondering why PyTorch can't find CUDA.\n\n```\n# vllm-deployment.yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: vllm-llama3\n  labels:\n    app: vllm-inference\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: vllm-inference\n  template:\n    metadata:\n      labels:\n        app: vllm-inference\n    spec:\n      containers:\n      - name: vllm\n        image: vllm/vllm-openai:v0.6.4\n        args:\n        - \"--model\"\n        - \"meta-llama/Llama-3.1-8B-Instruct\"\n        - \"--max-model-len\"\n        - \"4096\"\n        - \"--gpu-memory-utilization\"\n        - \"0.90\"\n        - \"--dtype\"\n        - \"auto\"\n        ports:\n        - containerPort: 8000\n          name: http\n        resources:\n          limits:\n            nvidia.com/gpu: 1\n          requests:\n            nvidia.com/gpu: 1\n            memory: \"24Gi\"\n            cpu: \"4\"\n        env:\n        - name: HUGGING_FACE_HUB_TOKEN\n          valueFrom:\n            secretKeyRef:\n              name: hf-token\n              key: token\n        readinessProbe:\n          httpGet:\n            path: /health\n            port: 8000\n          initialDelaySeconds: 120\n          periodSeconds: 10\n        livenessProbe:\n          httpGet:\n            path: /health\n            port: 8000\n          initialDelaySeconds: 180\n          periodSeconds: 30\n      tolerations:\n      - key: nvidia.com/gpu\n        operator: Exists\n        effect: NoSchedule\n---\napiVersion: v1\nkind: Service\nmetadata:\n  name: vllm-service\nspec:\n  selector:\n    app: vllm-inference\n  ports:\n  - port: 8000\n    targetPort: 8000\n  type: ClusterIP\n```\n\nCreate the HuggingFace token secret first:\n\n```\nkubectl create secret generic hf-token \\\n  --from-literal=token=$HF_TOKEN\n```\n\nThen deploy:\n\n```\nkubectl apply -f vllm-deployment.yaml\n```\n\nThe model download takes a few minutes depending on the model size. Watch the logs:\n\n```\nkubectl logs -f deployment/vllm-llama3\n```\n\nYou'll see it load the model weights, compile the CUDA kernels, and eventually:\n\n```\nINFO:     Uvicorn running on http://0.0.0.0:8000\n```\n\nPort-forward and hit it with curl:\n\n```\nkubectl port-forward svc/vllm-service 8000:8000\n\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"meta-llama/Llama-3.1-8B-Instruct\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Explain Kubernetes in one sentence\"}],\n    \"max_tokens\": 100\n  }'\n```\n\nThe API is OpenAI-compatible. Your existing code that talks to `gpt-4`\n\njust needs a base URL change.\n\nA few things that bit me:\n\n**Model download speed** — OKE nodes have good bandwidth to the internet, but the first pull of a 16GB model takes time. I ended up baking the model into a custom Docker image so pod restarts don't re-download. That's a separate blog post.\n\n**Memory headroom** — `gpu-memory-utilization: 0.90`\n\nleaves 10% for KV cache overhead. Don't set this to 0.99 thinking you're being efficient. vLLM will OOM during burst traffic.\n\n**Readiness probe timing** — `initialDelaySeconds: 120`\n\nseems high, but model loading legitimately takes 60-90 seconds on an A10. If your probe fires too early, Kubernetes will restart the pod in a loop.\n\n**Preemptible instances** — At $0.46/hr they're incredible for dev/staging. For production, use on-demand and set up a second preemptible pool as overflow. I'll cover that in a future post about cost optimization.\n\nRunning Llama 3.1 8B on different clouds (single GPU, on-demand):\n\n| Cloud | Shape | $/hr | $/month (24/7) |\n|---|---|---|---|\n| OCI | VM.GPU.A10.1 | $1.52 | ~$1,094 |\n| AWS | g5.xlarge | $3.06 | ~$2,203 |\n| Azure | NC24ads_A100_v4 | $3.67 | ~$2,642 |\n| GCP | g2-standard-8 | $2.86 | ~$2,059 |\n\nOCI is roughly half the price for equivalent hardware. And the preemptible pricing makes it even more dramatic for non-production workloads.\n\nThis is the simplest possible setup — one model, one GPU, one replica. In the next posts I'll cover:\n\nThe full YAML files are on my GitHub. If you're running inference on OCI, I'd love to hear what shapes you're using.\n\n*Pavan Madduri — CNCF Golden Kubestronaut, building GPU/AI infrastructure tools. GitHub | LinkedIn | Website | Google Scholar | ResearchGate*", "url": "https://wpnews.pro/news/deploying-vllm-on-oke-with-nvidia-a10-gpus-the-20-minute-setup-nobody-talks", "canonical_source": "https://dev.to/pavan_madduri/deploying-vllm-on-oke-with-nvidia-a10-gpus-the-20-minute-setup-nobody-talks-about-3je7", "published_at": "2026-06-16 19:42:21+00:00", "updated_at": "2026-06-16 20:17:22.602778+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Oracle Cloud Infrastructure", "OKE", "NVIDIA A10", "vLLM", "Llama 3", "Kubernetes", "AWS", "Azure"], "alternates": {"html": "https://wpnews.pro/news/deploying-vllm-on-oke-with-nvidia-a10-gpus-the-20-minute-setup-nobody-talks", "markdown": "https://wpnews.pro/news/deploying-vllm-on-oke-with-nvidia-a10-gpus-the-20-minute-setup-nobody-talks.md", "text": "https://wpnews.pro/news/deploying-vllm-on-oke-with-nvidia-a10-gpus-the-20-minute-setup-nobody-talks.txt", "jsonld": "https://wpnews.pro/news/deploying-vllm-on-oke-with-nvidia-a10-gpus-the-20-minute-setup-nobody-talks.jsonld"}}