Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance

Developers can deploy Llama 3.1 8B behind vLLM's OpenAI-compatible API on a rented GPU instance, achieving thousands of output tokens per second through continuous batching. The tutorial covers installing vLLM, authenticating with Hugging Face, starting the inference server, and sending requests, enabling production-ready LLM serving in under an hour.

Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance Go from a bare cloud VM to a production-ready, OpenAI-compatible inference server in under an hour, using vLLM's continuous batching to hit thousands of output tokens per second on a single GPU. Priya Nair https://www.devclubhouse.com/u/priya nair What You'll Build You'll deploy Llama 3.1 8B behind vLLM's OpenAI-compatible API on a rented GPU instance, then verify that continuous batching actually delivers an order-of-magnitude throughput advantage over sequential serving. Prerequisites - A cloud GPU instance with at least one NVIDIA A10G 24 GB VRAM . Lambda Labs ~$0.80/hr for A10 , RunPod, and Vast.ai all work. An A100 40/80 GB handles larger models or higher concurrency. Ubuntu 22.04 , CUDA 12.1+ standard on ML-optimized images . Confirm with nvidia-smi . Python 3.10 or 3.11 . vLLM 0.6.x is not fully validated on 3.12 yet.- A Hugging Face account with the Meta Llama 3.1 license accepted at hf.co/meta-llama/Meta-Llama-3.1-8B-Instruct , plus a read -scoped access token. - SSH access to the instance. GPU driver installation and VPC networking are out of scope here. 1. Prepare the VM Verify the GPU and CUDA runtime are visible before touching Python: nvidia-smi python3 --version Create an isolated environment: python3 -m venv ~/vllm-env source ~/vllm-env/bin/activate pip install --upgrade pip 2. Install vLLM pip install vllm This pulls PyTorch 2.4+ compiled for CUDA 12.1 along with vLLM's paged-attention kernels. Expect 3-5 minutes and roughly 4 GB of downloads. Confirm the install: python python -c "import vllm; print vllm. version " If your image has CUDA 11.8 uncommon on current offerings , vLLM 0.6.x no longer supports it. Use the official vLLM Docker image instead: docker pull vllm/vllm-openai:latest . 3. Authenticate with Hugging Face export HF TOKEN="hf YOUR TOKEN HERE" Add this to ~/.profile for persistence, or store it in your cloud provider's secrets manager. Never commit tokens to source control or bake them into Docker layers. 4. Start the Inference Server vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.90 \ --max-model-len 8192 \ --served-model-name llama3 Key flags at a glance: | Flag | Effect | |---|---| --tensor-parallel-size | Shards the model across N GPUs. Set to 2 on a dual-A10G node for near-linear throughput scaling. | --gpu-memory-utilization | Fraction of VRAM reserved for the KV cache. Leave headroom; 0.90 works well for most cases. | --max-model-len | Caps total sequence length. Llama 3.1 supports 128k natively, but fitting that KV cache on 24 GB is impossible at BF16. | --served-model-name | The model ID clients send in requests; decouples your API surface from the HF repo path. | Weights download on first run ~16 GB for BF16 . Subsequent starts read from ~/.cache/huggingface . The server is ready when you see: INFO: Application startup complete. 5. Send Your First Request From a second terminal no venv needed for curl : curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3", "messages": {"role": "user", "content": "What is PagedAttention?"} , "max tokens": 150, "temperature": 0.7 }' The response schema is identical to OpenAI's. Point any OpenAI SDK at the server by changing base url : python from openai import OpenAI client = OpenAI base url="http://localhost:8000/v1", api key="ignored" response = client.chat.completions.create model="llama3", messages= {"role": "user", "content": "Write a haiku about GPU memory."} , max tokens=60, print response.choices 0 .message.content api key can be any non-empty string by default. See step 7 for enforcing real authentication. 6. Load Test: Concurrent Requests vLLM's continuous batching combines in-flight requests into a single forward pass on every scheduling step, rather than waiting to fill a static batch. The throughput difference is dramatic. Run this to see it: python bench.py import asyncio, time from openai import AsyncOpenAI client = AsyncOpenAI base url="http://localhost:8000/v1", api key="ignored" PROMPT = "Summarize the history of neural networks in three sentences." CONCURRENCY, TOTAL = 50, 200 async def one request sem : async with sem: t0 = time.monotonic r = await client.chat.completions.create model="llama3", messages= {"role": "user", "content": PROMPT} , max tokens=100, return time.monotonic - t0, r.usage.completion tokens async def main : sem = asyncio.Semaphore CONCURRENCY t start = time.monotonic results = await asyncio.gather one request sem for in range TOTAL elapsed = time.monotonic - t start total tok = sum t for , t in results print f"Requests: {TOTAL} | Wall time: {elapsed:.1f}s" print f"Aggregate throughput: {total tok / elapsed:.0f} tokens/sec" print f"Avg latency: {sum l for l, in results / TOTAL:.2f}s" asyncio.run main pip install openai python bench.py A single A10G running Llama 3.1 8B in BF16 typically reaches 1,200-2,000 aggregate output tokens/sec under this load. Sequential, one-at-a-time serving on the same hardware delivers roughly 35-45 tokens/sec , because the GPU sits idle between requests while waiting for the next one to arrive. 7. Production Hardening Run vLLM as a systemd service so it survives SSH disconnects: /etc/systemd/system/vllm.service Unit Description=vLLM inference server After=network.target Service Type=simple User=ubuntu Environment="HF TOKEN=hf YOUR TOKEN HERE" Environment="PATH=/home/ubuntu/vllm-env/bin:/usr/bin:/bin" ExecStart=/home/ubuntu/vllm-env/bin/vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \ --host 127.0.0.1 \ --port 8000 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.90 \ --max-model-len 8192 \ --served-model-name llama3 Restart=on-failure RestartSec=10 Install WantedBy=multi-user.target The server binds to 127.0.0.1 here, not 0.0.0.0 . Never expose port 8000 directly to the internet without authentication. The simplest secure access pattern from a dev machine: ssh -L 8000:localhost:8000 ubuntu@your-gpu-host For production, put nginx in front with TLS termination and auth basic or a JWT validation block, or use Caddy with an API key middleware. Enable and start: sudo systemctl daemon-reload sudo systemctl enable --now vllm sudo journalctl -fu vllm Verify It Works Query the models endpoint to confirm the server registered correctly: curl http://localhost:8000/v1/models | python3 -m json.tool Expected output contains "id": "llama3" and "object": "model" . The scheduler logs in journalctl show live KV cache utilization and per-step batch sizes. Watch nvidia-smi dmon in a separate pane to confirm the GPU is saturated during the load test. Troubleshooting torch.cuda.OutOfMemoryError or CUDA out of memory on startup. The KV cache allocation exceeded available VRAM. Lower --gpu-memory-utilization to 0.80 , or reduce --max-model-len halving it roughly halves KV cache size . A 128k context at BF16 needs more than 24 GB on its own. High latency at low concurrency. Continuous batching optimizes for aggregate throughput, not time-to-first-token on isolated requests. For latency-sensitive single-request workloads, set --max-num-seqs 1 to disable multi-request batching. OSError: You are trying to access a gated repo. Either HF TOKEN is not set in the current shell echo $HF TOKEN to verify , or your account has not accepted the Llama 3.1 license on Hugging Face. The acceptance must be done on the model page, not just in account settings. Garbled or truncated outputs. Check that --max-model-len is larger than your prompt token count plus max tokens . vLLM will silently truncate the prompt from the left when the combined length exceeds the configured limit. Next Steps Quantization: --quantization fp8 runs on-the-fly FP8 quantization vLLM 0.5+ and fits larger models into the same VRAM. For AWQ, you need a pre-quantized checkpoint from a community hub like Hugging Face; then pass --quantization awq pointing at that repo. Multi-GPU tensor parallelism: --tensor-parallel-size 2 on a dual-A100 node gives near-linear throughput scaling with no code changes. Structured outputs: --guided-decoding-backend outlines enforces JSON Schema constraints on generation, useful for tool-calling pipelines. Metrics: vLLM exposes a Prometheus-compatible endpoint at GET /metrics . Scrape it with a Grafana agent to track KV cache hit rates, queue depth, and inter-token latency percentiles in real time. Priya Nair https://www.devclubhouse.com/u/priya nair · AI & Developer Experience Writer Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to. Discussion 0 No comments yet Be the first to weigh in.