Last month I needed to stand up a Llama 3 inference endpoint for an internal tool. The requirements were simple: OpenAI-compatible API, auto-scaling, and it couldn't cost more than the team's coffee budget. AWS wanted $3.06/hr for a g5.xlarge
. Azure quoted something similar.
Then I looked at OCI's GPU shapes. VM.GPU.A10.1
— a single NVIDIA A10 with 24GB VRAM — at $1.52/hr on-demand. Half the price. And on preemptible? $0.46/hr. That's a latte.
Here's how I got vLLM running on OKE in about 20 minutes.
If you already have an OKE cluster, skip ahead. If not, this is the fastest path:
oci network vcn create \
--compartment-id $COMPARTMENT_ID \
--cidr-blocks '["10.0.0.0/16"]' \
--display-name "ai-inference-vcn"
oci ce cluster create \
--compartment-id $COMPARTMENT_ID \
--name "inference-cluster" \
--vcn-id $VCN_ID \
--kubernetes-version "v1.30.1" \
--service-lb-subnet-ids "[$PUBLIC_SUBNET_ID]"
The key part is the GPU node pool. OCI has several GPU shapes, but for inference the A10 is the sweet spot:
| Shape | GPU | VRAM | $/hr (on-demand) | $/hr (preemptible) |
|---|---|---|---|---|
| VM.GPU.A10.1 | 1x A10 | 24 GB | ~$1.52 | ~$0.46 |
| VM.GPU.A10.2 | 2x A10 | 48 GB | ~$3.04 | ~$0.91 |
| BM.GPU.A100-v2.8 | 8x A100 | 640 GB | ~$26.52 | N/A |
For a 7B parameter model, a single A10 is plenty. For 70B, you'd want 2xA10 or the A100 bare metal.
oci ce node-pool create \
--cluster-id $CLUSTER_ID \
--compartment-id $COMPARTMENT_ID \
--name "gpu-a10-pool" \
--node-shape "VM.GPU.A10.1" \
--size 1 \
--node-config-details \
'{"size": 1, "placementConfigs": [{"availabilityDomain": "'"$AD"'", "subnetId": "'"$WORKER_SUBNET_ID"'"}]}' \
--node-source-details \
'{"sourceType": "IMAGE", "imageId": "'"$GPU_IMAGE_ID"'"}'
Make sure you use the OKE GPU image — it comes with NVIDIA drivers and nvidia-container-toolkit
pre-installed. You don't want to deal with driver installation yourself. Trust me.
OKE's GPU images already include the drivers, but Kubernetes needs the device plugin to expose GPUs as a schedulable resource:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.16.1
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
kubectl apply -f nvidia-device-plugin.yaml
Verify GPUs show up:
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
If that says "1"
, you're golden.
vLLM's Docker image is the easiest way to run it. No pip installs, no dependency conflicts, no wondering why PyTorch can't find CUDA.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3
labels:
app: vllm-inference
spec:
replicas: 1
selector:
matchLabels:
app: vllm-inference
template:
metadata:
labels:
app: vllm-inference
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.4
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--max-model-len"
- "4096"
- "--gpu-memory-utilization"
- "0.90"
- "--dtype"
- "auto"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
memory: "24Gi"
cpu: "4"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-inference
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
Create the HuggingFace token secret first:
kubectl create secret generic hf-token \
--from-literal=token=$HF_TOKEN
Then deploy:
kubectl apply -f vllm-deployment.yaml
The model download takes a few minutes depending on the model size. Watch the logs:
kubectl logs -f deployment/vllm-llama3
You'll see it load the model weights, compile the CUDA kernels, and eventually:
INFO: Uvicorn running on http://0.0.0.0:8000
Port-forward and hit it with curl:
kubectl port-forward svc/vllm-service 8000:8000
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Explain Kubernetes in one sentence"}],
"max_tokens": 100
}'
The API is OpenAI-compatible. Your existing code that talks to gpt-4
just needs a base URL change.
A few things that bit me:
Model download speed — OKE nodes have good bandwidth to the internet, but the first pull of a 16GB model takes time. I ended up baking the model into a custom Docker image so pod restarts don't re-download. That's a separate blog post.
Memory headroom — gpu-memory-utilization: 0.90
leaves 10% for KV cache overhead. Don't set this to 0.99 thinking you're being efficient. vLLM will OOM during burst traffic.
Readiness probe timing — initialDelaySeconds: 120
seems high, but model legitimately takes 60-90 seconds on an A10. If your probe fires too early, Kubernetes will restart the pod in a loop.
Preemptible instances — At $0.46/hr they're incredible for dev/staging. For production, use on-demand and set up a second preemptible pool as overflow. I'll cover that in a future post about cost optimization.
Running Llama 3.1 8B on different clouds (single GPU, on-demand):
| Cloud | Shape | $/hr | $/month (24/7) |
|---|---|---|---|
| OCI | VM.GPU.A10.1 | $1.52 | ~$1,094 |
| AWS | g5.xlarge | $3.06 | ~$2,203 |
| Azure | NC24ads_A100_v4 | $3.67 | ~$2,642 |
| GCP | g2-standard-8 | $2.86 | ~$2,059 |
OCI is roughly half the price for equivalent hardware. And the preemptible pricing makes it even more dramatic for non-production workloads.
This is the simplest possible setup — one model, one GPU, one replica. In the next posts I'll cover:
The full YAML files are on my GitHub. If you're running inference on OCI, I'd love to hear what shapes you're using.
Pavan Madduri — CNCF Golden Kubestronaut, building GPU/AI infrastructure tools. GitHub | LinkedIn | Website | Google Scholar | ResearchGate