Connecting OpenCode to a Self-Hosted LLM (vLLM + Nemotron 3 Super) OpenCode, an open-source terminal-first coding agent, now supports self-hosted large language models via OpenAI-compatible APIs, enabling users to connect it to a vLLM server running NVIDIA's Nemotron-3-Super-120B-A12B model. This setup avoids vendor lock-in and translation proxies required by tools like Claude Code, though proper tool-call and reasoning parsers are critical for agent functionality. Connecting OpenCode to a Self-Hosted LLM vLLM + Nemotron 3 Super Coding agents like Claude Code and Codex are excellent, but both are wired to a specific vendor's API. If you run your own inference stack — for cost control, data residency, or because you have GPUs sitting idle — you want an agent you can point at your endpoint. OpenCode https://opencode.ai/?ref=corti.com is the cleanest fit: it's terminal-first, open source, and talks to any OpenAI-compatible API without a translation layer. This post walks through connecting OpenCode's CLI to a self-hosted vLLM https://docs.vllm.ai/?ref=corti.com server, using NVIDIA's Nemotron-3-Super-120B-A12B as the worked example. This is the model I'm currently hosting on my 2 NVIDIA DGX Spark node cluster https://corti.com/serving-nemotron-super-120b-with-a-1m-token-context-on-a-2-node-dgx-spark-cluster/ . The model choice matters: it's a reasoning model with a hybrid Mamba/MoE architecture, which surfaces a few gotchas that a vanilla chat model wouldn't. Everything here generalizes to any OpenAI-compatible endpoint — substitute your own model and host. The one thing that decides everything: API shape There are two API "shapes" in the coding-agent world: OpenAI Chat Completions POST /v1/chat/completions — what vLLM, Ollama, LM Studio, and most self-hosted runtimes speak. Anthropic Messages POST /v1/messages — what Claude Code speaks. This is the whole ballgame. Claude Code cannot talk to a vLLM endpoint directly — it needs a translation proxy e.g. LiteLLM that accepts Anthropic requests and re-emits them as OpenAI. OpenCode speaks OpenAI natively , so there's no proxy: you add a provider block and you're done. That single fact is why OpenCode is the lower-friction choice for a self-hosted setup. Prerequisites - A vLLM server exposing an OpenAI-compatible endpoint with tool calling enabled the agent loop is dead without it . - The OpenCode CLI installed brew install opencode , npm i -g opencode , or the install script from opencode.ai . curl and jq for validation. Step 1 — Serve the model with the right parsers For agentic coding, two server-side parsers do the heavy lifting: - A tool-call parser that extracts structured tool calls from the model's raw output. - A reasoning parser that separates chain-of-thought from the user-facing answer only relevant for reasoning models . Get either wrong and the agent breaks in confusing ways — reasoning text leaks into tool arguments, or tool calls never get parsed at all. For Nemotron 3 Super, NVIDIA specifies the qwen3 coder tool parser yes, even though this isn't a Qwen model and a super v3 / nemotron v3 reasoning parser. A representative single-node serve command: vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \ --served-model-name nvidia/nemotron-3-super \ --host 0.0.0.0 --port 8000 \ --trust-remote-code \ --kv-cache-dtype fp8 \ --max-model-len 262144 \ --gpu-memory-utilization 0.85 \ --enable-chunked-prefill \ --enable-auto-tool-choice \ --tool-call-parser qwen3 coder \ --reasoning-parser nemotron v3 Authoritative flags live on the model card.Tensor-parallel size, quantization, MoE backend, and the exact reasoning-parser invocation are model- and hardware-specific. For Nemotron the HF model card and vLLM recipes are the source of truth. Don't copy a serve command from a blog including this one without checking it against the card for your checkpoint and GPU. A note on quantization : pre-quantized NVFP4/FP8 checkpoints carry their own quant config, and vLLM auto-detects it. Forcing --quantization fp4 is at best redundant and at worst selects a different kernel path — prefer auto-detection unless the card tells you otherwise. Step 2 — Store the credential If your server enforces an API key vLLM does this when VLLM API KEY is set in its environment , OpenCode needs that key. Store it without putting it in a config file: opencode auth login → scroll to "Other" → provider ID: myserver you'll reuse this exact ID in config → paste your API key This writes only the credential to ~/.local/share/opencode/auth.json . You still have to add the provider block in Step 3. Step 3 — Add the provider block Edit ~/.config/opencode/opencode.json global or a project-local opencode.json : { "$schema": "https://opencode.ai/config.json", "provider": { "pulsar": { "npm": "@ai-sdk/openai-compatible", "name": "Self-Hosted vLLM", "options": { "baseURL": "https://llm.example.internal/v1", "apiKey": "{env:VLLM API KEY}" }, "models": { "nvidia/nemotron-3-super": { "name": "Nemotron-3-Super-120B", "limit": { "context": 262144, "output": 32768 } } } } }, "model": "myserver/nvidia/nemotron-3-super" } Field-by-field: — the adapter for any npm: "@ai-sdk/openai-compatible" /v1/chat/completions endpoint. If a model is served via /v1/responses instead, use @ai-sdk/openai . — ends at options.baseURL /v1 , not the full /v1/chat/completions path. The adapter appends the rest. — options.apiKey {env:VAR} reads from the environment at launch; {file:~/.secrets/key} reads from a file. Either beats a hardcoded literal. If you used opencode auth login , you can omit this. — must match models keys exactly what your server returns as the model ID, i.e. your --served-model-name . Verify with the /v1/models call below. OpenCode tolerates / in model IDs, so nvidia/nemotron-3-super works as a key — a case Claude Code can't handle.— see the limit.context best practices https://claude.ai/chat/f5290b9d-8d3a-47cd-8636-df1c0552c84e?ref=corti.com best-practices ; do not blindly set this to your --max-model-len . — sets the default; the runtime form is model providerID/modelID , so with a slashed model ID you get the double slash pulsar/nvidia/nemotron-3-super . Step 4 — Validate the endpoint before trusting it Wire-checking the endpoint by hand saves you from debugging "why is my agent weird" later. Do it in three escalating steps. 4a. Can I even reach the model list? curl -s https://llm.example.internal/v1/models \ -H "Authorization: Bearer $VLLM API KEY" | jq '.data .id' This should print your served model ID. If you get: jq: error at