Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

A developer deployed a Llama 3 inference endpoint on Oracle Cloud Infrastructure's OKE cluster using vLLM and NVIDIA A10 GPUs in about 20 minutes. The setup cost $1.52/hr on-demand or $0.46/hr preemptible, significantly cheaper than AWS or Azure equivalents. The process involved creating an OKE cluster, a GPU node pool with the A10 shape, installing the NVIDIA device plugin, and deploying vLLM via a Kubernetes deployment.

Last month I needed to stand up a Llama 3 inference endpoint for an internal tool. The requirements were simple: OpenAI-compatible API, auto-scaling, and it couldn't cost more than the team's coffee budget. AWS wanted $3.06/hr for a g5.xlarge . Azure quoted something similar. Then I looked at OCI's GPU shapes. VM.GPU.A10.1 — a single NVIDIA A10 with 24GB VRAM — at $1.52/hr on-demand. Half the price. And on preemptible? $0.46/hr. That's a latte. Here's how I got vLLM running on OKE in about 20 minutes. If you already have an OKE cluster, skip ahead. If not, this is the fastest path: Create a VCN or use an existing one oci network vcn create \ --compartment-id $COMPARTMENT ID \ --cidr-blocks ' "10.0.0.0/16" ' \ --display-name "ai-inference-vcn" Create the OKE cluster oci ce cluster create \ --compartment-id $COMPARTMENT ID \ --name "inference-cluster" \ --vcn-id $VCN ID \ --kubernetes-version "v1.30.1" \ --service-lb-subnet-ids " $PUBLIC SUBNET ID " The key part is the GPU node pool. OCI has several GPU shapes, but for inference the A10 is the sweet spot: | Shape | GPU | VRAM | $/hr on-demand | $/hr preemptible | |---|---|---|---|---| | VM.GPU.A10.1 | 1x A10 | 24 GB | ~$1.52 | ~$0.46 | | VM.GPU.A10.2 | 2x A10 | 48 GB | ~$3.04 | ~$0.91 | | BM.GPU.A100-v2.8 | 8x A100 | 640 GB | ~$26.52 | N/A | For a 7B parameter model, a single A10 is plenty. For 70B, you'd want 2xA10 or the A100 bare metal. Create the GPU node pool oci ce node-pool create \ --cluster-id $CLUSTER ID \ --compartment-id $COMPARTMENT ID \ --name "gpu-a10-pool" \ --node-shape "VM.GPU.A10.1" \ --size 1 \ --node-config-details \ '{"size": 1, "placementConfigs": {"availabilityDomain": "'"$AD"'", "subnetId": "'"$WORKER SUBNET ID"'"} }' \ --node-source-details \ '{"sourceType": "IMAGE", "imageId": "'"$GPU IMAGE ID"'"}' Make sure you use the OKE GPU image — it comes with NVIDIA drivers and nvidia-container-toolkit pre-installed. You don't want to deal with driver installation yourself. Trust me. OKE's GPU images already include the drivers, but Kubernetes needs the device plugin to expose GPUs as a schedulable resource: nvidia-device-plugin.yaml apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds template: metadata: labels: name: nvidia-device-plugin-ds spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1 name: nvidia-device-plugin-ctr env: - name: FAIL ON INIT ERROR value: "false" securityContext: allowPrivilegeEscalation: false capabilities: drop: "ALL" volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins kubectl apply -f nvidia-device-plugin.yaml Verify GPUs show up: kubectl get nodes -o json | jq '.items .status.capacity "nvidia.com/gpu" ' "1" If that says "1" , you're golden. vLLM's Docker image is the easiest way to run it. No pip installs, no dependency conflicts, no wondering why PyTorch can't find CUDA. vllm-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama3 labels: app: vllm-inference spec: replicas: 1 selector: matchLabels: app: vllm-inference template: metadata: labels: app: vllm-inference spec: containers: - name: vllm image: vllm/vllm-openai:v0.6.4 args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--max-model-len" - "4096" - "--gpu-memory-utilization" - "0.90" - "--dtype" - "auto" ports: - containerPort: 8000 name: http resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 memory: "24Gi" cpu: "4" env: - name: HUGGING FACE HUB TOKEN valueFrom: secretKeyRef: name: hf-token key: token readinessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 120 periodSeconds: 10 livenessProbe: httpGet: path: /health port: 8000 initialDelaySeconds: 180 periodSeconds: 30 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule --- apiVersion: v1 kind: Service metadata: name: vllm-service spec: selector: app: vllm-inference ports: - port: 8000 targetPort: 8000 type: ClusterIP Create the HuggingFace token secret first: kubectl create secret generic hf-token \ --from-literal=token=$HF TOKEN Then deploy: kubectl apply -f vllm-deployment.yaml The model download takes a few minutes depending on the model size. Watch the logs: kubectl logs -f deployment/vllm-llama3 You'll see it load the model weights, compile the CUDA kernels, and eventually: INFO: Uvicorn running on http://0.0.0.0:8000 Port-forward and hit it with curl: kubectl port-forward svc/vllm-service 8000:8000 curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": {"role": "user", "content": "Explain Kubernetes in one sentence"} , "max tokens": 100 }' The API is OpenAI-compatible. Your existing code that talks to gpt-4 just needs a base URL change. A few things that bit me: Model download speed — OKE nodes have good bandwidth to the internet, but the first pull of a 16GB model takes time. I ended up baking the model into a custom Docker image so pod restarts don't re-download. That's a separate blog post. Memory headroom — gpu-memory-utilization: 0.90 leaves 10% for KV cache overhead. Don't set this to 0.99 thinking you're being efficient. vLLM will OOM during burst traffic. Readiness probe timing — initialDelaySeconds: 120 seems high, but model loading legitimately takes 60-90 seconds on an A10. If your probe fires too early, Kubernetes will restart the pod in a loop. Preemptible instances — At $0.46/hr they're incredible for dev/staging. For production, use on-demand and set up a second preemptible pool as overflow. I'll cover that in a future post about cost optimization. Running Llama 3.1 8B on different clouds single GPU, on-demand : | Cloud | Shape | $/hr | $/month 24/7 | |---|---|---|---| | OCI | VM.GPU.A10.1 | $1.52 | ~$1,094 | | AWS | g5.xlarge | $3.06 | ~$2,203 | | Azure | NC24ads A100 v4 | $3.67 | ~$2,642 | | GCP | g2-standard-8 | $2.86 | ~$2,059 | OCI is roughly half the price for equivalent hardware. And the preemptible pricing makes it even more dramatic for non-production workloads. This is the simplest possible setup — one model, one GPU, one replica. In the next posts I'll cover: The full YAML files are on my GitHub. If you're running inference on OCI, I'd love to hear what shapes you're using. Pavan Madduri — CNCF Golden Kubestronaut, building GPU/AI infrastructure tools. GitHub | LinkedIn | Website | Google Scholar | ResearchGate