Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

wpnews.pro

Last month I needed to stand up a Llama 3 inference endpoint for an internal tool. The requirements were simple: OpenAI-compatible API, auto-scaling, and it couldn't cost more than the team's coffee budget. AWS wanted $3.06/hr for a g5.xlarge

. Azure quoted something similar.

Then I looked at OCI's GPU shapes. VM.GPU.A10.1

— a single NVIDIA A10 with 24GB VRAM — at $1.52/hr on-demand. Half the price. And on preemptible? $0.46/hr. That's a latte.

Here's how I got vLLM running on OKE in about 20 minutes.

If you already have an OKE cluster, skip ahead. If not, this is the fastest path:

oci network vcn create \
  --compartment-id $COMPARTMENT_ID \
  --cidr-blocks '["10.0.0.0/16"]' \
  --display-name "ai-inference-vcn"

oci ce cluster create \
  --compartment-id $COMPARTMENT_ID \
  --name "inference-cluster" \
  --vcn-id $VCN_ID \
  --kubernetes-version "v1.30.1" \
  --service-lb-subnet-ids "[$PUBLIC_SUBNET_ID]"

The key part is the GPU node pool. OCI has several GPU shapes, but for inference the A10 is the sweet spot:

Shape	GPU	VRAM	$/hr (on-demand)	$/hr (preemptible)
VM.GPU.A10.1	1x A10	24 GB	~$1.52	~$0.46
VM.GPU.A10.2	2x A10	48 GB	~$3.04	~$0.91
BM.GPU.A100-v2.8	8x A100	640 GB	~$26.52	N/A

For a 7B parameter model, a single A10 is plenty. For 70B, you'd want 2xA10 or the A100 bare metal.

oci ce node-pool create \
  --cluster-id $CLUSTER_ID \
  --compartment-id $COMPARTMENT_ID \
  --name "gpu-a10-pool" \
  --node-shape "VM.GPU.A10.1" \
  --size 1 \
  --node-config-details \
    '{"size": 1, "placementConfigs": [{"availabilityDomain": "'"$AD"'", "subnetId": "'"$WORKER_SUBNET_ID"'"}]}' \
  --node-source-details \
    '{"sourceType": "IMAGE", "imageId": "'"$GPU_IMAGE_ID"'"}'

Make sure you use the OKE GPU image — it comes with NVIDIA drivers and nvidia-container-toolkit

pre-installed. You don't want to deal with driver installation yourself. Trust me.

OKE's GPU images already include the drivers, but Kubernetes needs the device plugin to expose GPUs as a schedulable resource:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
kubectl apply -f nvidia-device-plugin.yaml

Verify GPUs show up:

kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'

If that says "1"

, you're golden.

vLLM's Docker image is the easiest way to run it. No pip installs, no dependency conflicts, no wondering why PyTorch can't find CUDA.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
  labels:
    app: vllm-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.4
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--max-model-len"
        - "4096"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--dtype"
        - "auto"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
            cpu: "4"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-inference
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

Create the HuggingFace token secret first:

kubectl create secret generic hf-token \
  --from-literal=token=$HF_TOKEN

Then deploy:

kubectl apply -f vllm-deployment.yaml

The model download takes a few minutes depending on the model size. Watch the logs:

kubectl logs -f deployment/vllm-llama3

You'll see it load the model weights, compile the CUDA kernels, and eventually:

INFO:     Uvicorn running on http://0.0.0.0:8000

Port-forward and hit it with curl:

kubectl port-forward svc/vllm-service 8000:8000

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain Kubernetes in one sentence"}],
    "max_tokens": 100
  }'

The API is OpenAI-compatible. Your existing code that talks to gpt-4

just needs a base URL change.

A few things that bit me:

Model download speed — OKE nodes have good bandwidth to the internet, but the first pull of a 16GB model takes time. I ended up baking the model into a custom Docker image so pod restarts don't re-download. That's a separate blog post.

Memory headroom — gpu-memory-utilization: 0.90

leaves 10% for KV cache overhead. Don't set this to 0.99 thinking you're being efficient. vLLM will OOM during burst traffic.

Readiness probe timing — initialDelaySeconds: 120

seems high, but model legitimately takes 60-90 seconds on an A10. If your probe fires too early, Kubernetes will restart the pod in a loop.

Preemptible instances — At $0.46/hr they're incredible for dev/staging. For production, use on-demand and set up a second preemptible pool as overflow. I'll cover that in a future post about cost optimization.

Running Llama 3.1 8B on different clouds (single GPU, on-demand):

Cloud	Shape	$/hr	$/month (24/7)
OCI	VM.GPU.A10.1	$1.52	~$1,094
AWS	g5.xlarge	$3.06	~$2,203
Azure	NC24ads_A100_v4	$3.67	~$2,642
GCP	g2-standard-8	$2.86	~$2,059

OCI is roughly half the price for equivalent hardware. And the preemptible pricing makes it even more dramatic for non-production workloads.

This is the simplest possible setup — one model, one GPU, one replica. In the next posts I'll cover:

The full YAML files are on my GitHub. If you're running inference on OCI, I'd love to hear what shapes you're using.

Pavan Madduri — CNCF Golden Kubestronaut, building GPU/AI infrastructure tools. GitHub | LinkedIn | Website | Google Scholar | ResearchGate

source & further reading

dev.to — original article Voice Commander: Control Your Mac with Hey Jarvis Turn Chatbot Misunderstandings Into Grammar Regression Tests I Built an AI Name Generator Because Naming Is Harder Than It Looks

Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

Run your AI side-project on zahid.host