{"slug": "run-a-vllm-server-on-hf-jobs-in-one-command", "title": "Run a vLLM Server on HF Jobs in One Command", "summary": "Hugging Face launched a one-command method to run a vLLM server on its Jobs infrastructure, enabling users to quickly deploy models for testing, evaluation, or batch generation. The feature uses the official vLLM Docker image, exposes a public proxy URL gated by Hugging Face tokens, and bills per minute of hardware usage.", "body_md": "# Run a vLLM Server on HF Jobs in One Command\n\n[Update on GitHub](https://github.com/huggingface/blog/blob/main/vllm-jobs.md)\n\nIt's the quickest way to stand up a model for tests, evals, or batch generation. (If you're after a managed, production-ready service instead, that's what [Inference Endpoints](https://huggingface.co/docs/inference-endpoints) are for — [more on when to pick which](#hf-jobs-or-inference-endpoints) at the end.)\n\nHere's the whole thing end to end.\n\n## Prerequisites\n\n- A payment method or a positive prepaid credit balance (Jobs is billed per‑minute by hardware usage).\n`huggingface_hub >= 1.20.0`\n\n:`pip install -U \"huggingface_hub>=1.20.0\"`\n\n.- Logged in locally:\n`hf auth login`\n\n.\n\n## Launch the server\n\n`hf jobs run`\n\nis `docker run`\n\nfor HF infrastructure. We use the official `vllm/vllm-openai`\n\nimage, ask for a GPU with `--flavor`\n\n, and expose vLLM's port with `--expose`\n\n:\n\n```\nhf jobs run --flavor a10g-large --expose 8000 --timeout 2h \\\n  vllm/vllm-openai:latest \\\n  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000\n```\n\n`--expose 8000`\n\nroutes the container's port through HF's public jobs proxy (see the [Serve Models guide](https://huggingface.co/docs/hub/jobs-serving) for the full reference). The command prints the URL your server is reachable at:\n\n```\n✓ Job started\n  id: 6a381ca1953ed90bfb947332\n  url: https://huggingface.co/jobs/qgallouedec/6a381ca1953ed90bfb947332\nHint: Exposed ports are reachable at (requires an HF token with read access to the job):\n  https://6a381ca1953ed90bfb947332--8000.hf.jobs\n```\n\n`6a381ca1953ed90bfb947332`\n\nis your job ID. Keep track of it, we'll need it. We'll use `<job_id>`\n\nas a placeholder for it in the rest of the post.\n\nGive it a couple of minutes to download weights and boot. When the logs show `Application startup complete`\n\n, you're live.\n\n## Query it from anywhere\n\nvLLM speaks the OpenAI API, and every request just needs your HF token as a bearer token. The quickest way to hit it is curl:\n\n```\ncurl https://<job_id>--8000.hf.jobs/v1/chat/completions \\\n  -H \"Authorization: Bearer $(hf auth token)\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"Qwen/Qwen3-4B\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n    \"chat_template_kwargs\": {\"enable_thinking\": false}\n  }'\n```\n\nwhich returns the usual OpenAI-style JSON, with `choices[0].message.content`\n\nholding `\"Hello! How can I assist you today? 😊\"`\n\n.\n\nOr, from Python, point the OpenAI client at the exposed URL and pass the token as the API key:\n\n``` python\nfrom huggingface_hub import get_token\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"https://<job_id>--8000.hf.jobs/v1\",\n    api_key=get_token(),\n)\nresp = client.chat.completions.create(\n    model=\"Qwen/Qwen3-4B\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n    extra_body={\"chat_template_kwargs\": {\"enable_thinking\": False}},\n)\nprint(resp.choices[0].message.content)\nHello! How can I assist you today? 😊\n```\n\nQuick health check before you start: `curl https://<job_id>--8000.hf.jobs/v1/models -H \"Authorization: Bearer $(hf auth token)\"`\n\nshould list the model.\n\n🔐 The endpoint is gated, not public.Every request must carry an HF token withread access to the job's namespace. A plain browser visit will be rejected. In effect, the jobs proxyisyour API gate: access is scoped to you (and your org). That's fine for private use, but treat the URL accordingly: don't share it expecting it to be open, and don't paste your token into untrusted places. If you need finer-grained or public access, put a proper gateway in front instead. Or see[HF Jobs or Inference Endpoints?]below.\n\n## Clean up\n\nJobs are billed per second, so stop the server when you're done:\n\n```\nhf jobs cancel <job_id>\n```\n\nThe `--timeout`\n\nyou set is a safety net (it'll auto-stop), but cancelling explicitly is cheaper. An `a10g-large`\n\nruns at $1.50/hour — check `hf jobs hardware`\n\nfor the full price list and pick the smallest flavor that fits your model.\n\n## Going further: bigger models\n\nThe same command scales to much larger models — pick a beefier `--flavor`\n\nand tell vLLM to shard the model across the GPUs with `--tensor-parallel-size`\n\n. For example, the 122B Qwen3.5 mixture-of-experts model on 2× H200:\n\n```\nhf jobs run --flavor h200x2 --expose 8000 --timeout 2h \\\n  vllm/vllm-openai:latest \\\n  vllm serve Qwen/Qwen3.5-122B-A10B \\\n  --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \\\n  --max-model-len 32768 --max-num-seqs 256\n```\n\n`--tensor-parallel-size`\n\nshould match the number of GPUs in the flavor (`h200x2`\n\n→ 2, `h200x8`\n\n→ 8). Run `hf jobs hardware`\n\nto see what's available and give bigger models a longer `--timeout`\n\n, since they take longer to download and load. For large models, H200 flavors are usually the best value.\n\nThe `--max-model-len 32768 --max-num-seqs 256`\n\nflags are specific to this model: Qwen3.5-122B is a hybrid Mamba/attention architecture with a 256K-token default context, which doesn't leave enough memory for vLLM's default batch settings. Capping the context length and concurrent-sequence count keeps it within the GPUs' memory. If a model fails to start with an out-of-memory or cache-block error, dialing these two down is the first thing to try. Everything else (the exposed URL, the OpenAI client, the token auth) stays exactly the same.\n\n## Going further: Chat with it in a UI\n\nPrefer a chat window over curl? A few lines of [Gradio](https://www.gradio.app/) point at the same endpoint. Add `--reasoning-parser deepseek_r1`\n\nto the `vllm serve`\n\ncommand so Qwen3's thinking comes back as a separate field (not necessary, but helpful), then run this code locally (you'll just need the job ID):\n\n``` python\nimport gradio as gr\nfrom gradio import ChatMessage\nfrom huggingface_hub import get_token\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"https://<job_id>--8000.hf.jobs/v1\", api_key=get_token())\n\ndef chat(message, history):\n    messages = [{\"role\": m[\"role\"], \"content\": m[\"content\"]} for m in history if not m.get(\"metadata\")]\n    messages.append({\"role\": \"user\", \"content\": message})\n    stream = client.chat.completions.create(model=\"Qwen/Qwen3-4B\", messages=messages, stream=True)\n\n    thinking, answer = \"\", \"\"\n    for chunk in stream:\n        delta = chunk.choices[0].delta\n        thinking += delta.model_extra.get(\"reasoning\", \"\")\n        answer += delta.content or \"\"\n        out = []\n        if thinking.strip():\n            status = \"done\" if answer.strip() else \"pending\"\n            out.append(ChatMessage(role=\"assistant\", content=thinking, metadata={\"title\": \"💭 Thinking\", \"status\": status}))\n        if answer.strip():\n            out.append(ChatMessage(role=\"assistant\", content=answer))\n        yield out\n\ngr.ChatInterface(chat).launch()\n```\n\nRun it, open `http://127.0.0.1:7860`\n\n, and chat — reasoning streams into the collapsible panel, the answer below.\n\n## Going further: SSH into the running server\n\nNeed to debug a startup failure, watch GPU memory, or tail logs interactively? You can open a shell straight into the running job. Launch it with `--ssh`\n\nand make sure your public key is registered at [huggingface.co/settings/keys](https://huggingface.co/settings/keys):\n\n```\nhf jobs run --flavor a10g-large --expose 8000 --timeout 2h --ssh \\\n  vllm/vllm-openai:latest \\\n  vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000\n```\n\nthen connect with the job ID:\n\n```\nhf jobs ssh <job_id>\n```\n\nYou're now inside the container, where you can run `nvidia-smi`\n\n, inspect the process, or poke at the model directly — which makes debugging and monitoring much easier than reading logs from the outside. SSH support requires `huggingface_hub >= 1.20.0`\n\n.\n\n## Going further: Use it as a coding-agent backend with Pi\n\nThe same endpoint can back a terminal coding agent. [Pi](https://pi.dev) is a provider-agnostic agent harness. Point it at the job and you get a Read/Write/Edit/Bash agent running on your own self-hosted model.\n\nOne thing to set up first: agents drive the model through tool calls, and vLLM only accepts those if the server is launched with tool calling enabled. So relaunch with `--enable-auto-tool-choice`\n\nand a `--tool-call-parser`\n\nmatching the model family (`hermes`\n\nfor Qwen3). Agents also benefit from a stronger model, so this is a good place to bring in the bigger one:\n\n```\nhf jobs run --flavor h200x2 --expose 8000 --timeout 2h \\\n  vllm/vllm-openai:latest \\\n  vllm serve Qwen/Qwen3.5-122B-A10B \\\n  --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \\\n  --max-model-len 32768 --max-num-seqs 256 \\\n  --reasoning-parser deepseek_r1 \\\n  --enable-auto-tool-choice --tool-call-parser hermes\n```\n\nThen add the job as a custom provider in `~/.pi/agent/models.json`\n\n:\n\n```\n{\n  \"providers\": {\n    \"hf-jobs\": {\n      \"baseUrl\": \"https://<job_id>--8000.hf.jobs/v1\",\n      \"api\": \"openai-completions\",\n      \"apiKey\": \"!hf auth token\",\n      \"models\": [\n        { \"id\": \"Qwen/Qwen3.5-122B-A10B\" }\n      ]\n    }\n  }\n}\n```\n\nThen launch the agent against it:\n\n```\npi\n```\n\nThe model you spun up a couple of commands ago, now driving an interactive coding agent in your terminal.\n\n## HF Jobs or Inference Endpoints?\n\nHF Jobs isn't the only way to serve a model on Hugging Face. [Inference Endpoints](https://huggingface.co/docs/inference-endpoints) are our managed product for the same job, and which one fits depends on what you're after.\n\nReach for **HF Jobs** when you want maximum flexibility and control: it's just `docker run`\n\non HF infrastructure, so you pick the image, the exact `vllm serve`\n\nflags, and the hardware, and you pay per second for as long as the job runs. That makes it a great fit for experiments, one-off evals, batch generation, or kicking the tires on a model before committing to anything.\n\nReach for **Inference Endpoints** when you want something more production-ready. They add the operational niceties a long-lived service needs: finer-grained access control (an endpoint can be public, protected, or private), and scale-to-zero, so you're not billed during periods of inactivity. If you're standing up a durable endpoint rather than running a job, that's the tool to grab.\n\n## Further reading\n\nThis post sticks to vLLM, but the same expose-a-port pattern works with any OpenAI-compatible server. To serve GGUFs with llama.cpp or run SGLang instead, see the [Serve Models on Jobs guide](https://huggingface.co/docs/hub/jobs-serving), which walks through those backends.", "url": "https://wpnews.pro/news/run-a-vllm-server-on-hf-jobs-in-one-command", "canonical_source": "https://huggingface.co/blog/vllm-jobs", "published_at": "2026-06-25 20:42:58.519244+00:00", "updated_at": "2026-06-25 20:42:58.519244+00:00", "lang": "en", "topics": ["ai-infrastructure", "developer-tools", "large-language-models"], "entities": ["Hugging Face", "vLLM", "Qwen", "OpenAI", "Inference Endpoints"], "alternates": {"html": "https://wpnews.pro/news/run-a-vllm-server-on-hf-jobs-in-one-command", "markdown": "https://wpnews.pro/news/run-a-vllm-server-on-hf-jobs-in-one-command.md", "text": "https://wpnews.pro/news/run-a-vllm-server-on-hf-jobs-in-one-command.txt", "jsonld": "https://wpnews.pro/news/run-a-vllm-server-on-hf-jobs-in-one-command.jsonld"}}