Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance

wpnews.pro

Go from a bare cloud VM to a production-ready, OpenAI-compatible inference server in under an hour, using vLLM's continuous batching to hit thousands of output tokens per second on a single GPU.

Priya Nair

What You'll Build #

You'll deploy Llama 3.1 8B behind vLLM's OpenAI-compatible API on a rented GPU instance, then verify that continuous batching actually delivers an order-of-magnitude throughput advantage over sequential serving.

Prerequisites #

A cloud GPU instance with at least one NVIDIA A10G (24 GB VRAM). Lambda Labs (~$0.80/hr for A10), RunPod, and Vast.ai all work. An A100 40/80 GB handles larger models or higher concurrency. Ubuntu 22.04, CUDA 12.1+ (standard on ML-optimized images). Confirm withnvidia-smi

.Python 3.10 or 3.11. vLLM 0.6.x is not fully validated on 3.12 yet.- A Hugging Face account with the Meta Llama 3.1 license accepted athf.co/meta-llama/Meta-Llama-3.1-8B-Instruct

, plus aread

-scoped access token. - SSH access to the instance.

GPU driver installation and VPC networking are out of scope here.

1. Prepare the VM #

Verify the GPU and CUDA runtime are visible before touching Python:

nvidia-smi
python3 --version

Create an isolated environment:

python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip

2. Install vLLM #

pip install vllm

This pulls PyTorch 2.4+ compiled for CUDA 12.1 along with vLLM's paged-attention kernels. Expect 3-5 minutes and roughly 4 GB of downloads. Confirm the install:

python -c "import vllm; print(vllm.__version__)"

If your image has CUDA 11.8 (uncommon on current offerings), vLLM 0.6.x no longer supports it. Use the official vLLM Docker image instead: docker pull vllm/vllm-openai:latest

.

3. Authenticate with Hugging Face #

export HF_TOKEN="hf_YOUR_TOKEN_HERE"

Add this to ~/.profile

for persistence, or store it in your cloud provider's secrets manager. Never commit tokens to source control or bake them into Docker layers.

4. Start the Inference Server #

vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --served-model-name llama3

Key flags at a glance:

Flag	Effect
`--tensor-parallel-size`
Shards the model across N GPUs. Set to `2` on a dual-A10G node for near-linear throughput scaling.
`--gpu-memory-utilization`
Fraction of VRAM reserved for the KV cache. Leave headroom; 0.90 works well for most cases.
`--max-model-len`
Caps total sequence length. Llama 3.1 supports 128k natively, but fitting that KV cache on 24 GB is impossible at BF16.
`--served-model-name`
The model ID clients send in requests; decouples your API surface from the HF repo path.

Weights download on first run (~16 GB for BF16). Subsequent starts read from ~/.cache/huggingface

. The server is ready when you see:

INFO:     Application startup complete.

5. Send Your First Request #

From a second terminal (no venv needed for curl):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is PagedAttention?"}],
    "max_tokens": 150,
    "temperature": 0.7
  }'

The response schema is identical to OpenAI's. Point any OpenAI SDK at the server by changing base_url

:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="ignored")
response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Write a haiku about GPU memory."}],
    max_tokens=60,
)
print(response.choices[0].message.content)

api_key

can be any non-empty string by default. See step 7 for enforcing real authentication.

6. Load Test: Concurrent Requests #

vLLM's continuous batching combines in-flight requests into a single forward pass on every scheduling step, rather than waiting to fill a static batch. The throughput difference is dramatic. Run this to see it:

import asyncio, time
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="ignored")
PROMPT = "Summarize the history of neural networks in three sentences."
CONCURRENCY, TOTAL = 50, 200

async def one_request(sem):
    async with sem:
        t0 = time.monotonic()
        r = await client.chat.completions.create(
            model="llama3",
            messages=[{"role": "user", "content": PROMPT}],
            max_tokens=100,
        )
        return time.monotonic() - t0, r.usage.completion_tokens

async def main():
    sem = asyncio.Semaphore(CONCURRENCY)
    t_start = time.monotonic()
    results = await asyncio.gather(*[one_request(sem) for _ in range(TOTAL)])
    elapsed = time.monotonic() - t_start
    total_tok = sum(t for _, t in results)
    print(f"Requests: {TOTAL}  |  Wall time: {elapsed:.1f}s")
    print(f"Aggregate throughput: {total_tok / elapsed:.0f} tokens/sec")
    print(f"Avg latency: {sum(l for l, _ in results) / TOTAL:.2f}s")

asyncio.run(main())
pip install openai
python bench.py

A single A10G running Llama 3.1 8B in BF16 typically reaches 1,200-2,000 aggregate output tokens/sec under this load. Sequential, one-at-a-time serving on the same hardware delivers roughly 35-45 tokens/sec, because the GPU sits idle between requests while waiting for the next one to arrive.

7. Production Hardening #

Run vLLM as a systemd service so it survives SSH disconnects:

[Unit]
Description=vLLM inference server
After=network.target

[Service]
Type=simple
User=ubuntu
Environment="HF_TOKEN=hf_YOUR_TOKEN_HERE"
Environment="PATH=/home/ubuntu/vllm-env/bin:/usr/bin:/bin"
ExecStart=/home/ubuntu/vllm-env/bin/vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --served-model-name llama3
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

The server binds to 127.0.0.1

here, not 0.0.0.0

. Never expose port 8000 directly to the internet without authentication. The simplest secure access pattern from a dev machine:

ssh -L 8000:localhost:8000 ubuntu@your-gpu-host

For production, put nginx in front with TLS termination and auth_basic

or a JWT validation block, or use Caddy with an API key middleware.

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo journalctl -fu vllm

Verify It Works #

Query the models endpoint to confirm the server registered correctly:

curl http://localhost:8000/v1/models | python3 -m json.tool

Expected output contains "id": "llama3"

and "object": "model"

. The scheduler logs in journalctl

show live KV cache utilization and per-step batch sizes. Watch nvidia-smi dmon

in a separate pane to confirm the GPU is saturated during the load test.

Troubleshooting #

** torch.cuda.OutOfMemoryError or CUDA out of memory on startup.** The KV cache allocation exceeded available VRAM. Lower

--gpu-memory-utilization

to 0.80

, or reduce --max-model-len

(halving it roughly halves KV cache size). A 128k context at BF16 needs more than 24 GB on its own.High latency at low concurrency. Continuous batching optimizes for aggregate throughput, not time-to-first-token on isolated requests. For latency-sensitive single-request workloads, set --max-num-seqs 1

to disable multi-request batching.

** OSError: You are trying to access a gated repo.** Either

HF_TOKEN

is not set in the current shell (echo $HF_TOKEN

to verify), or your account has not accepted the Llama 3.1 license on Hugging Face. The acceptance must be done on the model page, not just in account settings.Garbled or truncated outputs. Check that --max-model-len

is larger than your prompt token count plus max_tokens

. vLLM will silently truncate the prompt from the left when the combined length exceeds the configured limit.

Next Steps #

Quantization:--quantization fp8

runs on-the-fly FP8 quantization (vLLM 0.5+) and fits larger models into the same VRAM. For AWQ, you need a pre-quantized checkpoint from a community hub like Hugging Face; then pass--quantization awq

pointing at that repo.Multi-GPU tensor parallelism:--tensor-parallel-size 2

on a dual-A100 node gives near-linear throughput scaling with no code changes.Structured outputs:--guided-decoding-backend outlines

enforces JSON Schema constraints on generation, useful for tool-calling pipelines.Metrics: vLLM exposes a Prometheus-compatible endpoint atGET /metrics

. Scrape it with a Grafana agent to track KV cache hit rates, queue depth, and inter-token latency percentiles in real time.

Priya Nair· AI & Developer Experience Writer

Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.

Discussion 0 #

No comments yet

Be the first to weigh in.

source & further reading

devclubhouse.com — original article The distillation attack no API can fully block The Thermodynamics of NVIDIA's 45°C Liquid Cooling Ditching ANTLR: How PostHog Rebuilt Its SQL Parser for a 70x Speedup