# Deploying vLLM on OKE with NVIDIA A10 GPUs: The 20-Minute Setup Nobody Talks About

> Source: <https://dev.to/pavan_madduri/deploying-vllm-on-oke-with-nvidia-a10-gpus-the-20-minute-setup-nobody-talks-about-3je7>
> Published: 2026-06-16 19:42:21+00:00

Last month I needed to stand up a Llama 3 inference endpoint for an internal tool. The requirements were simple: OpenAI-compatible API, auto-scaling, and it couldn't cost more than the team's coffee budget. AWS wanted $3.06/hr for a `g5.xlarge`

. Azure quoted something similar.

Then I looked at OCI's GPU shapes. `VM.GPU.A10.1`

— a single NVIDIA A10 with 24GB VRAM — at $1.52/hr on-demand. Half the price. And on preemptible? $0.46/hr. That's a latte.

Here's how I got vLLM running on OKE in about 20 minutes.

If you already have an OKE cluster, skip ahead. If not, this is the fastest path:

```
# Create a VCN (or use an existing one)
oci network vcn create \
  --compartment-id $COMPARTMENT_ID \
  --cidr-blocks '["10.0.0.0/16"]' \
  --display-name "ai-inference-vcn"

# Create the OKE cluster
oci ce cluster create \
  --compartment-id $COMPARTMENT_ID \
  --name "inference-cluster" \
  --vcn-id $VCN_ID \
  --kubernetes-version "v1.30.1" \
  --service-lb-subnet-ids "[$PUBLIC_SUBNET_ID]"
```

The key part is the GPU node pool. OCI has several GPU shapes, but for inference the A10 is the sweet spot:

| Shape | GPU | VRAM | $/hr (on-demand) | $/hr (preemptible) |
|---|---|---|---|---|
| VM.GPU.A10.1 | 1x A10 | 24 GB | ~$1.52 | ~$0.46 |
| VM.GPU.A10.2 | 2x A10 | 48 GB | ~$3.04 | ~$0.91 |
| BM.GPU.A100-v2.8 | 8x A100 | 640 GB | ~$26.52 | N/A |

For a 7B parameter model, a single A10 is plenty. For 70B, you'd want 2xA10 or the A100 bare metal.

```
# Create the GPU node pool
oci ce node-pool create \
  --cluster-id $CLUSTER_ID \
  --compartment-id $COMPARTMENT_ID \
  --name "gpu-a10-pool" \
  --node-shape "VM.GPU.A10.1" \
  --size 1 \
  --node-config-details \
    '{"size": 1, "placementConfigs": [{"availabilityDomain": "'"$AD"'", "subnetId": "'"$WORKER_SUBNET_ID"'"}]}' \
  --node-source-details \
    '{"sourceType": "IMAGE", "imageId": "'"$GPU_IMAGE_ID"'"}'
```

Make sure you use the **OKE GPU image** — it comes with NVIDIA drivers and `nvidia-container-toolkit`

pre-installed. You don't want to deal with driver installation yourself. Trust me.

OKE's GPU images already include the drivers, but Kubernetes needs the device plugin to expose GPUs as a schedulable resource:

```
# nvidia-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
        name: nvidia-device-plugin-ctr
        env:
        - name: FAIL_ON_INIT_ERROR
          value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
kubectl apply -f nvidia-device-plugin.yaml
```

Verify GPUs show up:

```
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
# "1"
```

If that says `"1"`

, you're golden.

vLLM's Docker image is the easiest way to run it. No pip installs, no dependency conflicts, no wondering why PyTorch can't find CUDA.

```
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
  labels:
    app: vllm-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-inference
  template:
    metadata:
      labels:
        app: vllm-inference
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.4
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--max-model-len"
        - "4096"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--dtype"
        - "auto"
        ports:
        - containerPort: 8000
          name: http
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
            cpu: "4"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-inference
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP
```

Create the HuggingFace token secret first:

```
kubectl create secret generic hf-token \
  --from-literal=token=$HF_TOKEN
```

Then deploy:

```
kubectl apply -f vllm-deployment.yaml
```

The model download takes a few minutes depending on the model size. Watch the logs:

```
kubectl logs -f deployment/vllm-llama3
```

You'll see it load the model weights, compile the CUDA kernels, and eventually:

```
INFO:     Uvicorn running on http://0.0.0.0:8000
```

Port-forward and hit it with curl:

```
kubectl port-forward svc/vllm-service 8000:8000

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain Kubernetes in one sentence"}],
    "max_tokens": 100
  }'
```

The API is OpenAI-compatible. Your existing code that talks to `gpt-4`

just needs a base URL change.

A few things that bit me:

**Model download speed** — OKE nodes have good bandwidth to the internet, but the first pull of a 16GB model takes time. I ended up baking the model into a custom Docker image so pod restarts don't re-download. That's a separate blog post.

**Memory headroom** — `gpu-memory-utilization: 0.90`

leaves 10% for KV cache overhead. Don't set this to 0.99 thinking you're being efficient. vLLM will OOM during burst traffic.

**Readiness probe timing** — `initialDelaySeconds: 120`

seems high, but model loading legitimately takes 60-90 seconds on an A10. If your probe fires too early, Kubernetes will restart the pod in a loop.

**Preemptible instances** — At $0.46/hr they're incredible for dev/staging. For production, use on-demand and set up a second preemptible pool as overflow. I'll cover that in a future post about cost optimization.

Running Llama 3.1 8B on different clouds (single GPU, on-demand):

| Cloud | Shape | $/hr | $/month (24/7) |
|---|---|---|---|
| OCI | VM.GPU.A10.1 | $1.52 | ~$1,094 |
| AWS | g5.xlarge | $3.06 | ~$2,203 |
| Azure | NC24ads_A100_v4 | $3.67 | ~$2,642 |
| GCP | g2-standard-8 | $2.86 | ~$2,059 |

OCI is roughly half the price for equivalent hardware. And the preemptible pricing makes it even more dramatic for non-production workloads.

This is the simplest possible setup — one model, one GPU, one replica. In the next posts I'll cover:

The full YAML files are on my GitHub. If you're running inference on OCI, I'd love to hear what shapes you're using.

*Pavan Madduri — CNCF Golden Kubestronaut, building GPU/AI infrastructure tools. GitHub | LinkedIn | Website | Google Scholar | ResearchGate*
