{"slug": "serve-an-open-source-llm-at-scale-with-vllm-on-a-rented-gpu-instance", "title": "Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance", "summary": "Developers can deploy Llama 3.1 8B behind vLLM's OpenAI-compatible API on a rented GPU instance, achieving thousands of output tokens per second through continuous batching. The tutorial covers installing vLLM, authenticating with Hugging Face, starting the inference server, and sending requests, enabling production-ready LLM serving in under an hour.", "body_md": "# Serve an Open-Source LLM at Scale with vLLM on a Rented GPU Instance\n\nGo from a bare cloud VM to a production-ready, OpenAI-compatible inference server in under an hour, using vLLM's continuous batching to hit thousands of output tokens per second on a single GPU.\n\n[Priya Nair](https://www.devclubhouse.com/u/priya_nair)\n\n## What You'll Build\n\nYou'll deploy Llama 3.1 8B behind vLLM's OpenAI-compatible API on a rented GPU instance, then verify that continuous batching actually delivers an order-of-magnitude throughput advantage over sequential serving.\n\n## Prerequisites\n\n- A cloud GPU instance with at least one\n**NVIDIA A10G (24 GB VRAM)**. Lambda Labs (~$0.80/hr for A10), RunPod, and Vast.ai all work. An A100 40/80 GB handles larger models or higher concurrency. **Ubuntu 22.04**, CUDA 12.1+ (standard on ML-optimized images). Confirm with`nvidia-smi`\n\n.**Python 3.10 or 3.11**. vLLM 0.6.x is not fully validated on 3.12 yet.- A Hugging Face account with the\n**Meta Llama 3.1 license accepted** at`hf.co/meta-llama/Meta-Llama-3.1-8B-Instruct`\n\n, plus a`read`\n\n-scoped access token. - SSH access to the instance.\n\nGPU driver installation and VPC networking are out of scope here.\n\n## 1. Prepare the VM\n\nVerify the GPU and CUDA runtime are visible before touching Python:\n\n```\nnvidia-smi\npython3 --version\n```\n\nCreate an isolated environment:\n\n```\npython3 -m venv ~/vllm-env\nsource ~/vllm-env/bin/activate\npip install --upgrade pip\n```\n\n## 2. Install vLLM\n\n```\npip install vllm\n```\n\nThis pulls PyTorch 2.4+ compiled for CUDA 12.1 along with vLLM's paged-attention kernels. Expect 3-5 minutes and roughly 4 GB of downloads. Confirm the install:\n\n``` python\npython -c \"import vllm; print(vllm.__version__)\"\n```\n\nIf your image has CUDA 11.8 (uncommon on current offerings), vLLM 0.6.x no longer supports it. Use the official vLLM Docker image instead: `docker pull vllm/vllm-openai:latest`\n\n.\n\n## 3. Authenticate with Hugging Face\n\n```\nexport HF_TOKEN=\"hf_YOUR_TOKEN_HERE\"\n```\n\nAdd this to `~/.profile`\n\nfor persistence, or store it in your cloud provider's secrets manager. Never commit tokens to source control or bake them into Docker layers.\n\n## 4. Start the Inference Server\n\n```\nvllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \\\n  --host 0.0.0.0 \\\n  --port 8000 \\\n  --tensor-parallel-size 1 \\\n  --gpu-memory-utilization 0.90 \\\n  --max-model-len 8192 \\\n  --served-model-name llama3\n```\n\nKey flags at a glance:\n\n| Flag | Effect |\n|---|---|\n`--tensor-parallel-size` |\nShards the model across N GPUs. Set to `2` on a dual-A10G node for near-linear throughput scaling. |\n`--gpu-memory-utilization` |\nFraction of VRAM reserved for the KV cache. Leave headroom; 0.90 works well for most cases. |\n`--max-model-len` |\nCaps total sequence length. Llama 3.1 supports 128k natively, but fitting that KV cache on 24 GB is impossible at BF16. |\n`--served-model-name` |\nThe model ID clients send in requests; decouples your API surface from the HF repo path. |\n\nWeights download on first run (~16 GB for BF16). Subsequent starts read from `~/.cache/huggingface`\n\n. The server is ready when you see:\n\n```\nINFO:     Application startup complete.\n```\n\n## 5. Send Your First Request\n\nFrom a second terminal (no venv needed for curl):\n\n```\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"llama3\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"What is PagedAttention?\"}],\n    \"max_tokens\": 150,\n    \"temperature\": 0.7\n  }'\n```\n\nThe response schema is identical to OpenAI's. Point any OpenAI SDK at the server by changing `base_url`\n\n:\n\n``` python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"ignored\")\nresponse = client.chat.completions.create(\n    model=\"llama3\",\n    messages=[{\"role\": \"user\", \"content\": \"Write a haiku about GPU memory.\"}],\n    max_tokens=60,\n)\nprint(response.choices[0].message.content)\n```\n\n`api_key`\n\ncan be any non-empty string by default. See step 7 for enforcing real authentication.\n\n## 6. Load Test: Concurrent Requests\n\nvLLM's continuous batching combines in-flight requests into a single forward pass on every scheduling step, rather than waiting to fill a static batch. The throughput difference is dramatic. Run this to see it:\n\n``` python\n# bench.py\nimport asyncio, time\nfrom openai import AsyncOpenAI\n\nclient = AsyncOpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"ignored\")\nPROMPT = \"Summarize the history of neural networks in three sentences.\"\nCONCURRENCY, TOTAL = 50, 200\n\nasync def one_request(sem):\n    async with sem:\n        t0 = time.monotonic()\n        r = await client.chat.completions.create(\n            model=\"llama3\",\n            messages=[{\"role\": \"user\", \"content\": PROMPT}],\n            max_tokens=100,\n        )\n        return time.monotonic() - t0, r.usage.completion_tokens\n\nasync def main():\n    sem = asyncio.Semaphore(CONCURRENCY)\n    t_start = time.monotonic()\n    results = await asyncio.gather(*[one_request(sem) for _ in range(TOTAL)])\n    elapsed = time.monotonic() - t_start\n    total_tok = sum(t for _, t in results)\n    print(f\"Requests: {TOTAL}  |  Wall time: {elapsed:.1f}s\")\n    print(f\"Aggregate throughput: {total_tok / elapsed:.0f} tokens/sec\")\n    print(f\"Avg latency: {sum(l for l, _ in results) / TOTAL:.2f}s\")\n\nasyncio.run(main())\npip install openai\npython bench.py\n```\n\nA single A10G running Llama 3.1 8B in BF16 typically reaches **1,200-2,000 aggregate output tokens/sec** under this load. Sequential, one-at-a-time serving on the same hardware delivers roughly **35-45 tokens/sec**, because the GPU sits idle between requests while waiting for the next one to arrive.\n\n## 7. Production Hardening\n\nRun vLLM as a systemd service so it survives SSH disconnects:\n\n```\n# /etc/systemd/system/vllm.service\n[Unit]\nDescription=vLLM inference server\nAfter=network.target\n\n[Service]\nType=simple\nUser=ubuntu\nEnvironment=\"HF_TOKEN=hf_YOUR_TOKEN_HERE\"\nEnvironment=\"PATH=/home/ubuntu/vllm-env/bin:/usr/bin:/bin\"\nExecStart=/home/ubuntu/vllm-env/bin/vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \\\n  --host 127.0.0.1 \\\n  --port 8000 \\\n  --tensor-parallel-size 1 \\\n  --gpu-memory-utilization 0.90 \\\n  --max-model-len 8192 \\\n  --served-model-name llama3\nRestart=on-failure\nRestartSec=10\n\n[Install]\nWantedBy=multi-user.target\n```\n\nThe server binds to `127.0.0.1`\n\nhere, not `0.0.0.0`\n\n. Never expose port 8000 directly to the internet without authentication. The simplest secure access pattern from a dev machine:\n\n```\nssh -L 8000:localhost:8000 ubuntu@your-gpu-host\n```\n\nFor production, put nginx in front with TLS termination and `auth_basic`\n\nor a JWT validation block, or use Caddy with an API key middleware.\n\nEnable and start:\n\n```\nsudo systemctl daemon-reload\nsudo systemctl enable --now vllm\nsudo journalctl -fu vllm\n```\n\n## Verify It Works\n\nQuery the models endpoint to confirm the server registered correctly:\n\n```\ncurl http://localhost:8000/v1/models | python3 -m json.tool\n```\n\nExpected output contains `\"id\": \"llama3\"`\n\nand `\"object\": \"model\"`\n\n. The scheduler logs in `journalctl`\n\nshow live KV cache utilization and per-step batch sizes. Watch `nvidia-smi dmon`\n\nin a separate pane to confirm the GPU is saturated during the load test.\n\n## Troubleshooting\n\n** torch.cuda.OutOfMemoryError or CUDA out of memory on startup.** The KV cache allocation exceeded available VRAM. Lower\n\n`--gpu-memory-utilization`\n\nto `0.80`\n\n, or reduce `--max-model-len`\n\n(halving it roughly halves KV cache size). A 128k context at BF16 needs more than 24 GB on its own.**High latency at low concurrency.** Continuous batching optimizes for aggregate throughput, not time-to-first-token on isolated requests. For latency-sensitive single-request workloads, set `--max-num-seqs 1`\n\nto disable multi-request batching.\n\n** OSError: You are trying to access a gated repo.** Either\n\n`HF_TOKEN`\n\nis not set in the current shell (`echo $HF_TOKEN`\n\nto verify), or your account has not accepted the Llama 3.1 license on Hugging Face. The acceptance must be done on the model page, not just in account settings.**Garbled or truncated outputs.** Check that `--max-model-len`\n\nis larger than your prompt token count plus `max_tokens`\n\n. vLLM will silently truncate the prompt from the left when the combined length exceeds the configured limit.\n\n## Next Steps\n\n**Quantization:**`--quantization fp8`\n\nruns on-the-fly FP8 quantization (vLLM 0.5+) and fits larger models into the same VRAM. For AWQ, you need a pre-quantized checkpoint from a community hub like Hugging Face; then pass`--quantization awq`\n\npointing at that repo.**Multi-GPU tensor parallelism:**`--tensor-parallel-size 2`\n\non a dual-A100 node gives near-linear throughput scaling with no code changes.**Structured outputs:**`--guided-decoding-backend outlines`\n\nenforces JSON Schema constraints on generation, useful for tool-calling pipelines.**Metrics:** vLLM exposes a Prometheus-compatible endpoint at`GET /metrics`\n\n. Scrape it with a Grafana agent to track KV cache hit rates, queue depth, and inter-token latency percentiles in real time.\n\n[Priya Nair](https://www.devclubhouse.com/u/priya_nair)· AI & Developer Experience Writer\n\nPriya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.\n\n## Discussion 0\n\nNo comments yet\n\nBe the first to weigh in.", "url": "https://wpnews.pro/news/serve-an-open-source-llm-at-scale-with-vllm-on-a-rented-gpu-instance", "canonical_source": "https://www.devclubhouse.com/a/serve-an-open-source-llm-at-scale-with-vllm-on-a-rented-gpu-instance", "published_at": "2026-06-23 17:42:49+00:00", "updated_at": "2026-06-24 00:14:32.018825+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "ai-products", "developer-tools"], "entities": ["vLLM", "Llama 3.1", "Meta", "Hugging Face", "NVIDIA", "Lambda Labs", "RunPod", "Vast.ai"], "alternates": {"html": "https://wpnews.pro/news/serve-an-open-source-llm-at-scale-with-vllm-on-a-rented-gpu-instance", "markdown": "https://wpnews.pro/news/serve-an-open-source-llm-at-scale-with-vllm-on-a-rented-gpu-instance.md", "text": "https://wpnews.pro/news/serve-an-open-source-llm-at-scale-with-vllm-on-a-rented-gpu-instance.txt", "jsonld": "https://wpnews.pro/news/serve-an-open-source-llm-at-scale-with-vllm-on-a-rented-gpu-instance.jsonld"}}