{"slug": "connecting-opencode-to-a-self-hosted-llm-vllm-nemotron-3-super", "title": "Connecting OpenCode to a Self-Hosted LLM (vLLM + Nemotron 3 Super)", "summary": "OpenCode, an open-source terminal-first coding agent, now supports self-hosted large language models via OpenAI-compatible APIs, enabling users to connect it to a vLLM server running NVIDIA's Nemotron-3-Super-120B-A12B model. This setup avoids vendor lock-in and translation proxies required by tools like Claude Code, though proper tool-call and reasoning parsers are critical for agent functionality.", "body_md": "# Connecting OpenCode to a Self-Hosted LLM (vLLM + Nemotron 3 Super)\n\nCoding agents like Claude Code and Codex are excellent, but both are wired to a specific vendor's API. If you run your own inference stack — for cost control, data residency, or because you have GPUs sitting idle — you want an agent you can point at *your* endpoint. [OpenCode](https://opencode.ai/?ref=corti.com) is the cleanest fit: it's terminal-first, open source, and talks to any OpenAI-compatible API without a translation layer.\n\nThis post walks through connecting OpenCode's CLI to a self-hosted [vLLM](https://docs.vllm.ai/?ref=corti.com) server, using NVIDIA's `Nemotron-3-Super-120B-A12B`\n\nas the worked example.\n\nThis is the model [I'm currently hosting on my 2 NVIDIA DGX Spark node cluster](https://corti.com/serving-nemotron-super-120b-with-a-1m-token-context-on-a-2-node-dgx-spark-cluster/).\n\nThe model choice matters: it's a *reasoning* model with a hybrid Mamba/MoE architecture, which surfaces a few gotchas that a vanilla chat model wouldn't.\n\nEverything here generalizes to any OpenAI-compatible endpoint — substitute your own model and host.\n\n## The one thing that decides everything: API shape\n\nThere are two API \"shapes\" in the coding-agent world:\n\n**OpenAI Chat Completions**(`POST /v1/chat/completions`\n\n) — what vLLM, Ollama, LM Studio, and most self-hosted runtimes speak.**Anthropic Messages**(`POST /v1/messages`\n\n) — what Claude Code speaks.\n\nThis is the whole ballgame. **Claude Code cannot talk to a vLLM endpoint directly** — it needs a translation proxy (e.g. LiteLLM) that accepts Anthropic requests and re-emits them as OpenAI. **OpenCode speaks OpenAI natively**, so there's no proxy: you add a provider block and you're done. That single fact is why OpenCode is the lower-friction choice for a self-hosted setup.\n\n## Prerequisites\n\n- A vLLM server exposing an OpenAI-compatible endpoint\n**with tool calling enabled**(the agent loop is dead without it). - The OpenCode CLI installed (\n`brew install opencode`\n\n,`npm i -g opencode`\n\n, or the install script from opencode.ai). `curl`\n\nand`jq`\n\nfor validation.\n\n## Step 1 — Serve the model with the *right* parsers\n\nFor agentic coding, two server-side parsers do the heavy lifting:\n\n- A\n**tool-call parser** that extracts structured`tool_calls`\n\nfrom the model's raw output. - A\n**reasoning parser** that separates chain-of-thought from the user-facing answer (only relevant for reasoning models).\n\nGet either wrong and the agent breaks in confusing ways — reasoning text leaks into tool arguments, or tool calls never get parsed at all.\n\nFor Nemotron 3 Super, NVIDIA specifies the `qwen3_coder`\n\ntool parser (yes, even though this isn't a Qwen model) and a `super_v3`\n\n/ `nemotron_v3`\n\nreasoning parser. A representative single-node serve command:\n\n```\nvllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \\\n  --served-model-name nvidia/nemotron-3-super \\\n  --host 0.0.0.0 --port 8000 \\\n  --trust-remote-code \\\n  --kv-cache-dtype fp8 \\\n  --max-model-len 262144 \\\n  --gpu-memory-utilization 0.85 \\\n  --enable-chunked-prefill \\\n  --enable-auto-tool-choice \\\n  --tool-call-parser qwen3_coder \\\n  --reasoning-parser nemotron_v3\n```\n\nAuthoritative flags live on the model card.Tensor-parallel size, quantization, MoE backend, and the exact reasoning-parser invocation are model- and hardware-specific. For Nemotron the[HF model card]and[vLLM recipes]are the source of truth. Don't copy a serve command from a blog (including this one) without checking it against the card for your checkpoint and GPU.\n\nA note on **quantization**: pre-quantized NVFP4/FP8 checkpoints carry their own quant config, and vLLM auto-detects it. Forcing `--quantization fp4`\n\nis at best redundant and at worst selects a different kernel path — prefer auto-detection unless the card tells you otherwise.\n\n## Step 2 — Store the credential\n\nIf your server enforces an API key (vLLM does this when `VLLM_API_KEY`\n\nis set in its environment), OpenCode needs that key. Store it without putting it in a config file:\n\n```\nopencode auth login\n# → scroll to \"Other\"\n# → provider ID: myserver      (you'll reuse this exact ID in config)\n# → paste your API key\n```\n\nThis writes only the credential to `~/.local/share/opencode/auth.json`\n\n. You still have to add the provider block in Step 3.\n\n## Step 3 — Add the provider block\n\nEdit `~/.config/opencode/opencode.json`\n\n(global) or a project-local `opencode.json`\n\n:\n\n```\n{\n  \"$schema\": \"https://opencode.ai/config.json\",\n  \"provider\": {\n    \"pulsar\": {\n      \"npm\": \"@ai-sdk/openai-compatible\",\n      \"name\": \"Self-Hosted vLLM\",\n      \"options\": {\n        \"baseURL\": \"https://llm.example.internal/v1\",\n        \"apiKey\": \"{env:VLLM_API_KEY}\"\n      },\n      \"models\": {\n        \"nvidia/nemotron-3-super\": {\n          \"name\": \"Nemotron-3-Super-120B\",\n          \"limit\": { \"context\": 262144, \"output\": 32768 }\n        }\n      }\n    }\n  },\n  \"model\": \"myserver/nvidia/nemotron-3-super\"\n}\n```\n\nField-by-field:\n\n— the adapter for any**npm: \"@ai-sdk/openai-compatible\"**`/v1/chat/completions`\n\nendpoint. If a model is served via`/v1/responses`\n\ninstead, use`@ai-sdk/openai`\n\n.\n\n— ends at**options.baseURL**`/v1`\n\n,**not** the full`/v1/chat/completions`\n\npath. The adapter appends the rest.\n\n—**options.apiKey**`{env:VAR}`\n\nreads from the environment at launch;`{file:~/.secrets/key}`\n\nreads from a file. Either beats a hardcoded literal. (If you used`opencode auth login`\n\n, you can omit this.)— must match`models`\n\nkeys**exactly** what your server returns as the model ID, i.e. your`--served-model-name`\n\n. Verify with the`/v1/models`\n\ncall below. OpenCode tolerates`/`\n\nin model IDs, so`nvidia/nemotron-3-super`\n\nworks as a key — a case Claude Code can't handle.— see the`limit.context`\n\n[best practices](https://claude.ai/chat/f5290b9d-8d3a-47cd-8636-df1c0552c84e?ref=corti.com#best-practices); do**not** blindly set this to your`--max-model-len`\n\n.\n\n— sets the default; the runtime form is**model**`providerID/modelID`\n\n, so with a slashed model ID you get the double slash`pulsar/nvidia/nemotron-3-super`\n\n.\n\n## Step 4 — Validate the endpoint before trusting it\n\nWire-checking the endpoint by hand saves you from debugging \"why is my agent weird\" later. Do it in three escalating steps.\n\n### 4a. Can I even reach the model list?\n\n```\ncurl -s https://llm.example.internal/v1/models \\\n  -H \"Authorization: Bearer $VLLM_API_KEY\" | jq '.data[].id'\n```\n\nThis should print your served model ID. If you get:\n\n```\njq: error (at <stdin>:0): Cannot iterate over null (null)\n```\n\n…that is **not** a model problem. It means the endpoint returned valid JSON with no `data`\n\nfield — almost always a `{\"error\": ...}`\n\nbody from a **401**, because the request was missing or had the wrong `Authorization`\n\nheader. (If the body were unparseable HTML you'd get a *parse* error instead.) Add the header. To prove it's the server and not your reverse proxy, hit the node directly, bypassing TLS/nginx:\n\n```\ncurl -s http://localhost:8000/v1/models -H \"Authorization: Bearer $VLLM_API_KEY\" | jq .\n```\n\n### 4b. One-shot tool-call smoke test\n\nA model that lists fine can still emit malformed tool calls. This test sends a trivial `get_weather`\n\ntool and a prompt that forces a call. Point it at your *public* endpoint (not localhost) so it also exercises your reverse proxy's handling of POST bodies — the exact path the agent will use.\n\n```\ncurl -s https://llm.example.internal/v1/chat/completions \\\n  -H \"Authorization: Bearer $VLLM_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d @- <<'JSON' | jq .\n{\n  \"model\": \"nvidia/nemotron-3-super\",\n  \"temperature\": 1.0,\n  \"top_p\": 0.95,\n  \"max_tokens\": 1024,\n  \"tool_choice\": \"auto\",\n  \"messages\": [\n    {\"role\": \"user\", \"content\": \"What is the current weather in Zurich? Call the get_weather tool to find out.\"}\n  ],\n  \"tools\": [\n    {\n      \"type\": \"function\",\n      \"function\": {\n        \"name\": \"get_weather\",\n        \"description\": \"Get the current weather for a city.\",\n        \"parameters\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"location\": {\"type\": \"string\", \"description\": \"City name, e.g. Zurich\"},\n            \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}\n          },\n          \"required\": [\"location\"]\n        }\n      }\n    }\n  ]\n}\nJSON\n```\n\nSampling is set to NVIDIA's recommended`temperature 1.0 / top_p 0.95`\n\n, which Nemotron's card prescribes foralltasks — reasoning, tool calling, and chat alike. Test under the same conditions your agent will run.\n\n**What a healthy response looks like:**\n\n```\n{\n  \"choices\": [\n    {\n      \"message\": {\n        \"role\": \"assistant\",\n        \"content\": null,\n        \"tool_calls\": [\n          {\n            \"id\": \"chatcmpl-tool-...\",\n            \"type\": \"function\",\n            \"function\": {\n              \"name\": \"get_weather\",\n              \"arguments\": \"{\\\"location\\\": \\\"Zurich\\\"}\"\n            }\n          }\n        ],\n        \"reasoning\": \"I need to get the current weather in Zurich...\"\n      },\n      \"finish_reason\": \"tool_calls\"\n    }\n  ],\n  \"system_fingerprint\": \"vllm-0.21.0+...-tp2-...\"\n}\n```\n\nThree things to read off this:\n\n`finish_reason: \"tool_calls\"`\n\nand a well-formed`tool_calls[0]`\n\n.`content: null`\n\nwith the chain-of-thought isolated in a separate`reasoning`\n\nfield.**This is the success signal for a reasoning model**— it proves the reasoning parser kept the thinking out of`content`\n\nand out of the tool arguments. When that separation fails, reasoning text contaminates the arguments and the agent loop breaks.- A\n`tp2`\n\n(or similar) tag in`system_fingerprint`\n\nconfirms your tensor-parallel topology is actually live — useful when you're serving across a multi-node cluster and want to be sure it didn't silently fall back to one node.\n\n### 4c. Pass/fail in one line\n\nThe check that actually matters is that `function.arguments`\n\nis a **parseable JSON string** — malformed arguments are the classic tool-parser failure. The `fromjson`\n\nstep below throws (→ FAIL) if they aren't valid JSON:\n\n```\ncurl -s https://llm.example.internal/v1/chat/completions \\\n  -H \"Authorization: Bearer $VLLM_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d @- <<'JSON' | jq -e '\n    .choices[0] as $c\n    | ($c.finish_reason == \"tool_calls\")\n      and ($c.message.tool_calls | type == \"array\")\n      and ($c.message.tool_calls[0].function.name == \"get_weather\")\n      and ($c.message.tool_calls[0].function.arguments | fromjson | type == \"object\")\n  ' >/dev/null && echo \"PASS: tool_calls well-formed\" || echo \"FAIL: inspect raw response\"\n{\n  \"model\": \"nvidia/nemotron-3-super\",\n  \"temperature\": 1.0, \"top_p\": 0.95, \"max_tokens\": 1024, \"tool_choice\": \"auto\",\n  \"messages\": [{\"role\": \"user\", \"content\": \"What is the current weather in Zurich? Call the get_weather tool to find out.\"}],\n  \"tools\": [{\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"description\": \"Get the current weather for a city.\", \"parameters\": {\"type\": \"object\", \"properties\": {\"location\": {\"type\": \"string\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}}, \"required\": [\"location\"]}}}]\n}\nJSON\n```\n\n### 4d. Multi-turn round-trip (the one people skip)\n\nA single call passing does **not** guarantee the parser handles the *tool-result* turn — where you feed the function's output back and the model continues. Agents do this on every step, so test it. Take the `id`\n\nfrom the tool call in 4b and echo it back in a `role: \"tool\"`\n\nmessage:\n\n```\ncurl -s https://llm.example.internal/v1/chat/completions \\\n  -H \"Authorization: Bearer $VLLM_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d @- <<'JSON' | jq '.choices[0] | {finish_reason, content: .message.content}'\n{\n  \"model\": \"nvidia/nemotron-3-super\",\n  \"temperature\": 1.0,\n  \"top_p\": 0.95,\n  \"max_tokens\": 1024,\n  \"tools\": [\n    {\"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"description\": \"Get the current weather for a city.\", \"parameters\": {\"type\": \"object\", \"properties\": {\"location\": {\"type\": \"string\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}}, \"required\": [\"location\"]}}}\n  ],\n  \"messages\": [\n    {\"role\": \"user\", \"content\": \"What is the current weather in Zurich? Call the get_weather tool.\"},\n    {\"role\": \"assistant\", \"content\": null, \"tool_calls\": [\n      {\"id\": \"chatcmpl-tool-REPLACE_WITH_REAL_ID\", \"type\": \"function\", \"function\": {\"name\": \"get_weather\", \"arguments\": \"{\\\"location\\\": \\\"Zurich\\\"}\"}}\n    ]},\n    {\"role\": \"tool\", \"tool_call_id\": \"chatcmpl-tool-REPLACE_WITH_REAL_ID\", \"content\": \"{\\\"location\\\": \\\"Zurich\\\", \\\"temp_c\\\": 12, \\\"condition\\\": \\\"cloudy\\\"}\"}\n  ]\n}\nJSON\n```\n\nA healthy result has `finish_reason: \"stop\"`\n\nand a natural-language `content`\n\nthat uses the 12°C / cloudy data you handed back. If it loops — calling `get_weather`\n\nagain instead of answering — the model isn't correctly consuming the tool result, which will manifest in OpenCode as an agent that repeats actions. Note: echo the assistant turn back **without** its `reasoning`\n\nfield; only `content`\n\nand `tool_calls`\n\nare required.\n\nOnce 4a–4d pass, point OpenCode at it — it'll use the default model from your config, or run `/models`\n\nand select `pulsar/nvidia/nemotron-3-super`\n\n.\n\n## OpenCode in Action\n\nOnce everything is set up, using OpenCode is straightforward.\n\nIf you also install OpenCode desktop, the same settings you configured for open code cli apply.\n\nWatching the cluster with [nvtop](https://github.com/Syllo/nvtop?ref=corti.com) shows the model is using both nodes' GPUs while coding.\n\n## Best practices\n\n**Set limit.context below --max-model-len, not equal to it.** A model that\n\n*advertises*1M context won't\n\n*fit*1M tokens of KV cache at a conservative\n\n`--gpu-memory-utilization`\n\non memory-constrained hardware. OpenCode uses `limit.context`\n\nto decide when to compact the conversation; if you tell it the theoretical max, it will pack prompts the server then rejects mid-session. Set it to a value you've verified fits end-to-end, with margin.**Give reasoning models a generous output budget.** Reasoning tokens are generated *before* the tool call and count against `max_tokens`\n\n. In testing, a one-argument tool call burned ~160 completion tokens, almost all of it reasoning. Real agentic steps reason far more. A stingy output limit causes `finish_reason: \"length\"`\n\ntruncation *before* the tool call is ever emitted — which looks like a parser failure but isn't.\n\n**Pin sampling to the model card's recommendation.** Don't let the agent's defaults override what the model was tuned for. For Nemotron that's `temperature 1.0 / top_p 0.95`\n\nacross the board.\n\n**Keep your secret in one place.** With `VLLM_API_KEY`\n\nenforced server-side and `{env:VLLM_API_KEY}`\n\n(or `auth.json`\n\n) client-side, that's a single shared secret. Rotating it means updating both the server environment and the client — script the rotation so they never drift.\n\n**Pin your runtime version.** Tool-call and reasoning parsers evolve fast across vLLM releases. Record the `system_fingerprint`\n\nfrom a known-good run; if behavior changes after an image bump, that's your first diff.\n\n**Harden the host if you serve large models on shared boxes.** A model that exhausts memory can take SSH down with it (ICMP still replies, `sshd`\n\ndoesn't — the worst kind of \"is it up?\"). Protect the essentials:\n\n```\n# Keep sshd from being OOM-killed\nsudo systemctl edit ssh   # add: [Service]\\nOOMScoreAdjust=-1000\n\n# Userspace OOM killer that acts before the kernel's does\nsudo apt install earlyoom && sudo systemctl enable --now earlyoom\n```\n\nPair that with an external watchdog (a separate machine curling `/health`\n\nand power-cycling on N consecutive failures) so a wedged node recovers without a desk visit.\n\n## Gotchas, condensed\n\n| Symptom | Cause | Fix |\n|---|---|---|\n`jq: Cannot iterate over null` on `/v1/models` | 401 — missing/wrong `Authorization` ; server returned `{\"error\": ...}` with no `data` | Add `-H \"Authorization: Bearer $VLLM_API_KEY\"` |\n| Model not found / wrong model in OpenCode | Config `models` key ≠ `--served-model-name` | Match exactly; confirm via `/v1/models` |\n`/` in model ID rejected | You're on Claude Code, not OpenCode | OpenCode handles slashes; for Claude Code, alias the served name without `/` |\n`finish_reason: \"length\"` , no tool call | Reasoning ate the output budget | Raise `max_tokens` (2048–4096) |\nTool call described in prose, `tool_calls` null | Tool parser not active or wrong | Verify `--enable-auto-tool-choice` + correct `--tool-call-parser` in startup logs |\n| Reasoning text inside tool arguments | Reasoning parser misconfigured | Use the model's prescribed reasoning parser; confirm `content` /`reasoning` are separate |\n`arguments` not parseable JSON | Genuine parser/model mismatch | Re-run; if persistent, file upstream |\n| Agent repeats the same tool call | Tool-result turn not consumed | Run the multi-turn test (4d); check `tool_call_id` echo |\n| Quant/kernel error at startup | Forced `--quantization` fighting the checkpoint | Drop it; let vLLM auto-detect |\nOpenCode `NotFoundError` , empty options | Older OpenCode bug not forwarding provider options | Update OpenCode; ensure the provider `name` field is present |\n| Endpoint reachable on localhost, not via domain | Reverse proxy not forwarding `/v1/*` or the POST body | Test through the proxy explicitly; fix the `location` block |\n\n## Wrap-up\n\nThe hard part of running a coding agent on your own iron isn't the agent — it's proving the *endpoint* behaves like a real OpenAI-compatible tool-calling server before you trust an autonomous loop to it. OpenCode keeps the agent side trivial: one provider block, native OpenAI, no proxy. Spend your effort on the four-step validation — model list, single tool call, JSON-valid arguments, and the multi-turn round-trip — and the rest is just `opencode`\n\n.", "url": "https://wpnews.pro/news/connecting-opencode-to-a-self-hosted-llm-vllm-nemotron-3-super", "canonical_source": "https://corti.com/connecting-opencode-to-a-self-hosted-llm-vllm-nemotron-3-super/", "published_at": "2026-06-19 08:04:59+00:00", "updated_at": "2026-06-26 12:03:13.979186+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "ai-agents", "developer-tools"], "entities": ["OpenCode", "vLLM", "NVIDIA", "Nemotron-3-Super-120B-A12B", "Claude Code", "Codex", "LiteLLM", "DGX Spark"], "alternates": {"html": "https://wpnews.pro/news/connecting-opencode-to-a-self-hosted-llm-vllm-nemotron-3-super", "markdown": "https://wpnews.pro/news/connecting-opencode-to-a-self-hosted-llm-vllm-nemotron-3-super.md", "text": "https://wpnews.pro/news/connecting-opencode-to-a-self-hosted-llm-vllm-nemotron-3-super.txt", "jsonld": "https://wpnews.pro/news/connecting-opencode-to-a-self-hosted-llm-vllm-nemotron-3-super.jsonld"}}