# Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance

> Source: <https://www.devclubhouse.com/a/serve-an-open-source-llm-at-scale-with-vllm-on-a-rented-gpu-instance>
> Published: 2026-06-23 17:42:49+00:00

# Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance

Go from a bare cloud VM to a production-ready, OpenAI-compatible inference server in under an hour, using vLLM's continuous batching to hit thousands of output tokens per second on a single GPU.

[Priya Nair](https://www.devclubhouse.com/u/priya_nair)

## What You'll Build

You'll deploy Llama 3.1 8B behind vLLM's OpenAI-compatible API on a rented GPU instance, then verify that continuous batching actually delivers an order-of-magnitude throughput advantage over sequential serving.

## Prerequisites

- A cloud GPU instance with at least one
**NVIDIA A10G (24 GB VRAM)**. Lambda Labs (~$0.80/hr for A10), RunPod, and Vast.ai all work. An A100 40/80 GB handles larger models or higher concurrency. **Ubuntu 22.04**, CUDA 12.1+ (standard on ML-optimized images). Confirm with`nvidia-smi`

.**Python 3.10 or 3.11**. vLLM 0.6.x is not fully validated on 3.12 yet.- A Hugging Face account with the
**Meta Llama 3.1 license accepted** at`hf.co/meta-llama/Meta-Llama-3.1-8B-Instruct`

, plus a`read`

-scoped access token. - SSH access to the instance.

GPU driver installation and VPC networking are out of scope here.

## 1. Prepare the VM

Verify the GPU and CUDA runtime are visible before touching Python:

```
nvidia-smi
python3 --version
```

Create an isolated environment:

```
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip
```

## 2. Install vLLM

```
pip install vllm
```

This pulls PyTorch 2.4+ compiled for CUDA 12.1 along with vLLM's paged-attention kernels. Expect 3-5 minutes and roughly 4 GB of downloads. Confirm the install:

``` python
python -c "import vllm; print(vllm.__version__)"
```

If your image has CUDA 11.8 (uncommon on current offerings), vLLM 0.6.x no longer supports it. Use the official vLLM Docker image instead: `docker pull vllm/vllm-openai:latest`

.

## 3. Authenticate with Hugging Face

```
export HF_TOKEN="hf_YOUR_TOKEN_HERE"
```

Add this to `~/.profile`

for persistence, or store it in your cloud provider's secrets manager. Never commit tokens to source control or bake them into Docker layers.

## 4. Start the Inference Server

```
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --served-model-name llama3
```

Key flags at a glance:

| Flag | Effect |
|---|---|
`--tensor-parallel-size` |
Shards the model across N GPUs. Set to `2` on a dual-A10G node for near-linear throughput scaling. |
`--gpu-memory-utilization` |
Fraction of VRAM reserved for the KV cache. Leave headroom; 0.90 works well for most cases. |
`--max-model-len` |
Caps total sequence length. Llama 3.1 supports 128k natively, but fitting that KV cache on 24 GB is impossible at BF16. |
`--served-model-name` |
The model ID clients send in requests; decouples your API surface from the HF repo path. |

Weights download on first run (~16 GB for BF16). Subsequent starts read from `~/.cache/huggingface`

. The server is ready when you see:

```
INFO:     Application startup complete.
```

## 5. Send Your First Request

From a second terminal (no venv needed for curl):

```
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "What is PagedAttention?"}],
    "max_tokens": 150,
    "temperature": 0.7
  }'
```

The response schema is identical to OpenAI's. Point any OpenAI SDK at the server by changing `base_url`

:

``` python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="ignored")
response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Write a haiku about GPU memory."}],
    max_tokens=60,
)
print(response.choices[0].message.content)
```

`api_key`

can be any non-empty string by default. See step 7 for enforcing real authentication.

## 6. Load Test: Concurrent Requests

vLLM's continuous batching combines in-flight requests into a single forward pass on every scheduling step, rather than waiting to fill a static batch. The throughput difference is dramatic. Run this to see it:

``` python
# bench.py
import asyncio, time
from openai import AsyncOpenAI

client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="ignored")
PROMPT = "Summarize the history of neural networks in three sentences."
CONCURRENCY, TOTAL = 50, 200

async def one_request(sem):
    async with sem:
        t0 = time.monotonic()
        r = await client.chat.completions.create(
            model="llama3",
            messages=[{"role": "user", "content": PROMPT}],
            max_tokens=100,
        )
        return time.monotonic() - t0, r.usage.completion_tokens

async def main():
    sem = asyncio.Semaphore(CONCURRENCY)
    t_start = time.monotonic()
    results = await asyncio.gather(*[one_request(sem) for _ in range(TOTAL)])
    elapsed = time.monotonic() - t_start
    total_tok = sum(t for _, t in results)
    print(f"Requests: {TOTAL}  |  Wall time: {elapsed:.1f}s")
    print(f"Aggregate throughput: {total_tok / elapsed:.0f} tokens/sec")
    print(f"Avg latency: {sum(l for l, _ in results) / TOTAL:.2f}s")

asyncio.run(main())
pip install openai
python bench.py
```

A single A10G running Llama 3.1 8B in BF16 typically reaches **1,200-2,000 aggregate output tokens/sec** under this load. Sequential, one-at-a-time serving on the same hardware delivers roughly **35-45 tokens/sec**, because the GPU sits idle between requests while waiting for the next one to arrive.

## 7. Production Hardening

Run vLLM as a systemd service so it survives SSH disconnects:

```
# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM inference server
After=network.target

[Service]
Type=simple
User=ubuntu
Environment="HF_TOKEN=hf_YOUR_TOKEN_HERE"
Environment="PATH=/home/ubuntu/vllm-env/bin:/usr/bin:/bin"
ExecStart=/home/ubuntu/vllm-env/bin/vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --served-model-name llama3
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
```

The server binds to `127.0.0.1`

here, not `0.0.0.0`

. Never expose port 8000 directly to the internet without authentication. The simplest secure access pattern from a dev machine:

```
ssh -L 8000:localhost:8000 ubuntu@your-gpu-host
```

For production, put nginx in front with TLS termination and `auth_basic`

or a JWT validation block, or use Caddy with an API key middleware.

Enable and start:

```
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo journalctl -fu vllm
```

## Verify It Works

Query the models endpoint to confirm the server registered correctly:

```
curl http://localhost:8000/v1/models | python3 -m json.tool
```

Expected output contains `"id": "llama3"`

and `"object": "model"`

. The scheduler logs in `journalctl`

show live KV cache utilization and per-step batch sizes. Watch `nvidia-smi dmon`

in a separate pane to confirm the GPU is saturated during the load test.

## Troubleshooting

** torch.cuda.OutOfMemoryError or CUDA out of memory on startup.** The KV cache allocation exceeded available VRAM. Lower

`--gpu-memory-utilization`

to `0.80`

, or reduce `--max-model-len`

(halving it roughly halves KV cache size). A 128k context at BF16 needs more than 24 GB on its own.**High latency at low concurrency.** Continuous batching optimizes for aggregate throughput, not time-to-first-token on isolated requests. For latency-sensitive single-request workloads, set `--max-num-seqs 1`

to disable multi-request batching.

** OSError: You are trying to access a gated repo.** Either

`HF_TOKEN`

is not set in the current shell (`echo $HF_TOKEN`

to verify), or your account has not accepted the Llama 3.1 license on Hugging Face. The acceptance must be done on the model page, not just in account settings.**Garbled or truncated outputs.** Check that `--max-model-len`

is larger than your prompt token count plus `max_tokens`

. vLLM will silently truncate the prompt from the left when the combined length exceeds the configured limit.

## Next Steps

**Quantization:**`--quantization fp8`

runs on-the-fly FP8 quantization (vLLM 0.5+) and fits larger models into the same VRAM. For AWQ, you need a pre-quantized checkpoint from a community hub like Hugging Face; then pass`--quantization awq`

pointing at that repo.**Multi-GPU tensor parallelism:**`--tensor-parallel-size 2`

on a dual-A100 node gives near-linear throughput scaling with no code changes.**Structured outputs:**`--guided-decoding-backend outlines`

enforces JSON Schema constraints on generation, useful for tool-calling pipelines.**Metrics:** vLLM exposes a Prometheus-compatible endpoint at`GET /metrics`

. Scrape it with a Grafana agent to track KV cache hit rates, queue depth, and inter-token latency percentiles in real time.

[Priya Nair](https://www.devclubhouse.com/u/priya_nair)· AI & Developer Experience Writer

Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.

## Discussion 0

No comments yet

Be the first to weigh in.
