{"slug": "pairing-claude-code-with-local-models", "title": "Pairing Claude Code with Local Models", "summary": "Ollama, LM Studio, and llama.cpp now support native Anthropic Messages API connections, enabling Claude Code to run entirely on local models for code completion, refactoring, and debugging tasks. By setting environment variables like `ANTHROPIC_BASE_URL` and model tier mappings, developers can redirect Claude Code's API calls to local inference servers, eliminating per-token costs and rate limits. The setup requires at least 16 GB RAM and Ollama v0.14.0 or later for Anthropic API compatibility.", "body_md": "# Pairing Claude Code with Local Models\n\nLocal models in 2026 are good enough. For the tasks Claude Code handles daily: code completion, refactoring, debugging, codebase explanation; a well-chosen quantized model running locally covers the vast majority of real use cases at zero per-token cost and with no rate limits.\n\n## # Introduction\n\nAgentic coding sessions are expensive. A single Claude Code session — reading files, writing code, running tests, iterating — can burn 10–50x more tokens than a plain chat conversation. At scale, that adds up fast. Add rate limits that can interrupt a long-running workflow mid-session, and the dependency on a third-party API that can change pricing, enforce stricter policies, or go down at any point, and the case for local inference becomes straightforward.\n\nLocal models in 2026 are good enough. For the tasks Claude Code handles daily — code completion, refactoring, debugging, codebase explanation — a well-chosen quantized model running locally covers the vast majority of real use cases at zero per-token cost and with no rate limits. This article covers three inference backends (** Ollama**,\n\n**, and**\n\n[LM Studio](https://lmstudio.ai/)**), the exact environment variables and configuration files to wire each one to Claude Code, a curated table of models worth running, and the troubleshooting fixes for the issues you will actually hit.**\n\n[llama.cpp](https://github.com/ggml-org/llama.cpp)\n\n## # How Claude Code Connects to Any Local Model\n\nThe mechanism is simpler than most guides make it look. Claude Code sends requests in the Anthropic Messages API format. By default those requests go to Anthropic's servers. Setting `ANTHROPIC_BASE_URL`\n\nredirects them to any server that speaks the same format, which now includes Ollama, LM Studio, and llama.cpp natively.\n\nAccording to the official Claude Code environment variables documentation, the variables that matter for this setup are:\n\n`ANTHROPIC_BASE_URL`\n\n: redirects all API calls from Anthropic's servers to whatever URL you set. Set this to your local inference server address.`ANTHROPIC_API_KEY`\n\n: the API key sent in the request header. Local servers typically ignore authentication, so this is usually set to a placeholder string like \"**local**\" or \"** ollama**.\"`ANTHROPIC_AUTH_TOKEN`\n\n: an alternative auth header. Some local servers check for this instead of the API key. Set it to the same placeholder.\n\n`ANTHROPIC_DEFAULT_SONNET_MODEL`\n\n, `ANTHROPIC_DEFAULT_HAIKU_MODEL`\n\n, and `ANTHROPIC_DEFAULT_OPUS_MODEL`\n\n: Claude Code internally requests different model tiers depending on the task. These three variables map each tier to your local model's name. Without them, Claude Code sends requests for `claude-sonnet-4-20250514`\n\nto your local server, which will reject the request because no such model exists locally.\n\nIn January 2026, Ollama added native support for the Anthropic Messages API, which was the technical change that made this workflow practical without translation proxies. LM Studio added a native `/v1/messages`\n\nendpoint in version 0.4.1. llama.cpp has had direct Anthropic API support for longer. All three now speak Claude Code's native protocol.\n\nA clean architecture diagram showing Claude Code, Ollama, LM Studio, and llama.cpp | Image by Author\n\n## # Backend 1: Ollama\n\nOllama is the right starting point. It handles all the complexity of model management — downloading weights, quantization, GPU and CPU allocation, and serving — behind a simple command-line interface (CLI). One command to install, one command to pull a model, a few environment variables to configure. It runs as a background service after install, so there is no manual server start required.\n\n**Prerequisites**\n\n- macOS, Linux, or Windows (WSL2 recommended on Windows)\n- At least 16 GB RAM for practical use (32 GB recommended)\n- GPU with 8+ GB VRAM for GPU inference, or CPU-only with enough RAM\n- Ollama v0.14.0 or later required for Anthropic Messages API support\n\nInstall Ollama:\n\n```\n# macOS and Linux -- one command install\ncurl -fsSL https://ollama.com/install.sh | sh\n\n# Verify the version -- must be 0.14.0+ for Claude Code compatibility\nollama version\n# Expected: ollama version is 0.14.x or higher\n\n# Windows: download the installer from https://ollama.com\n# Native Windows support has improved significantly in recent releases\n```\n\nAfter installation, Ollama starts automatically as a background service on port **11434**. You can verify it is running:\n\n```\n# Check the Ollama server is live\ncurl http://localhost:11434\n\n# Expected response:\n# Ollama is running\n```\n\nPull a coding model:\n\n```\n# GLM-4.7-Flash -- recommended starting point\n# Strong tool calling, 128K context, fits on 8 GB VRAM\n# Apache 2.0 license\nollama pull glm-4.7-flash:latest\n\n# Qwen3-Coder -- strong code generation and instruction following\n# Requires 20+ GB VRAM for the full model\nollama pull qwen3-coder\n\n# Devstral-Small -- specifically designed for agentic coding workflows\n# Community-tested for Claude Code compatibility\n# 24B, requires 16+ GB VRAM\nollama pull devstral-small-2:24b\n\n# Verify the model is downloaded and ready\nollama list\n# Shows all pulled models with their sizes and modification dates\n```\n\n#### // Configuring Claude Code to Use Ollama\n\n**Option 1: Shell export (current terminal session only)**\n\n```\n# Redirect Claude Code to your local Ollama server\nexport ANTHROPIC_BASE_URL=\"http://localhost:11434\"\n\n# Local servers do not require real authentication\n# Set these to any non-empty string -- Ollama ignores the value\nexport ANTHROPIC_API_KEY=\"ollama\"\nexport ANTHROPIC_AUTH_TOKEN=\"ollama\"\n\n# Map Claude Code's model tier requests to your local model name\n# Claude Code internally requests sonnet/haiku/opus -- these variables\n# translate those tier names to whatever model you have pulled locally\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"glm-4.7-flash:latest\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"glm-4.7-flash:latest\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"glm-4.7-flash:latest\"\n\n# Launch Claude Code -- it will now use Ollama instead of the Anthropic API\nclaude\n```\n\n**Option 2: ~/.claude/settings.json (permanent, applies to all sessions)**\n\nThis approach survives terminal restarts and applies every time you launch Claude Code. Claude Code reads environment variables from `settings.json`\n\nat startup so they take effect no matter how `claude`\n\nwas launched.\n\nCreate or edit `~/.claude/settings.json`\n\n:\n\n```\n{\n  \"env\": {\n    \"ANTHROPIC_BASE_URL\": \"http://localhost:11434\",\n    \"ANTHROPIC_API_KEY\": \"ollama\",\n    \"ANTHROPIC_AUTH_TOKEN\": \"ollama\",\n    \"ANTHROPIC_DEFAULT_SONNET_MODEL\": \"glm-4.7-flash:latest\",\n    \"ANTHROPIC_DEFAULT_HAIKU_MODEL\": \"glm-4.7-flash:latest\",\n    \"ANTHROPIC_DEFAULT_OPUS_MODEL\": \"glm-4.7-flash:latest\"\n  }\n}\n```\n\n**Option 3: .env file in project directory (per-project override)**\n\nIf you want a specific project to use a different model while keeping your global settings on the Anthropic API:\n\n```\n# .env in your project root -- loaded automatically by Claude Code\nANTHROPIC_BASE_URL=http://localhost:11434\nANTHROPIC_API_KEY=ollama\nANTHROPIC_AUTH_TOKEN=ollama\nANTHROPIC_DEFAULT_SONNET_MODEL=qwen3-coder\nANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3-coder\nANTHROPIC_DEFAULT_OPUS_MODEL=qwen3-coder\n```\n\nVerify the connection:\n\n```\n# Launch Claude Code with a simple test\nclaude\n\n# Inside Claude Code, run a basic prompt:\n# > What model are you running?\n# A local model should respond without making any Anthropic API calls.\n\n# To confirm no external calls are being made, run with verbose logging:\nclaude --verbose\n\n# Look for lines showing requests going to localhost:11434\n# rather than api.anthropic.com\n```\n\nFull working sequence from scratch:\n\n```\ncurl -fsSL https://ollama.com/install.sh | sh          # 1. Install Ollama\nollama pull glm-4.7-flash:latest                       # 2. Pull model (~4 GB)\nexport ANTHROPIC_BASE_URL=\"http://localhost:11434\"     # 3. Redirect Claude Code\nexport ANTHROPIC_API_KEY=\"ollama\"                      # 4. Set placeholder auth\nexport ANTHROPIC_AUTH_TOKEN=\"ollama\"\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"glm-4.7-flash:latest\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"glm-4.7-flash:latest\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"glm-4.7-flash:latest\"\nclaude                                                  # 5. Launch\n```\n\n## # Backend 2: LM Studio\n\nLM Studio is the right choice if you want a graphical interface for browsing and managing models rather than working entirely in the terminal. Since version 0.4.1, it includes a native Anthropic-compatible **/v1/messages** endpoint — the same path Claude Code expects — so no translation layer or proxy is needed.\n\n**Prerequisites:**\n\n- macOS, Windows, or Linux\n- GPU with 6+ GB VRAM recommended (CPU-only is possible but slow)\n- Download from lmstudio.ai or use the CLI installer for headless servers\n\nInstall and configure LM Studio:\n\n```\n# On a server or VM without a GUI -- CLI installer\ncurl -fsSL https://releases.lmstudio.ai/cli/install.sh | bash\n\n# Or download the desktop app from https://lmstudio.ai for GUI use\n```\n\nGUI setup steps:\n\n- Open LM Studio and search for a coding model (search \"qwen coder\" or \"devstral\").\n- Download the model. LM Studio handles quantization selection automatically.\n- Go to the\n**Local Server** tab (the`<>`\n\nicon in the left sidebar). - Set the context size. LM Studio recommends starting with at least 25,000 tokens and increasing for better results.\n- Click\n**Start Server**. - Note the port (default: 1234) and copy the model name exactly as shown.\n\nNote: Copy the model identifier exactly. LM Studio displays the exact string you need to pass to\n\n`ANTHROPIC_DEFAULT_SONNET_MODEL`\n\n. A mismatch here is the most common failure mode.\n\nConfigure Claude Code:\n\n```\n# Set the base URL to LM Studio's local server\nexport ANTHROPIC_BASE_URL=\"http://localhost:1234\"\nexport ANTHROPIC_API_KEY=\"lm-studio\"\nexport ANTHROPIC_AUTH_TOKEN=\"lm-studio\"\n\n# Replace the model name with what LM Studio shows for your loaded model\n# Copy it exactly -- including any version suffix or quantization tag\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"qwen2.5-coder-32b-instruct\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"qwen2.5-coder-32b-instruct\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"qwen2.5-coder-32b-instruct\"\n```\n\nOr persistently in `~/.claude/settings.json`\n\n:\n\n```\n{\n  \"env\": {\n    \"ANTHROPIC_BASE_URL\": \"http://localhost:1234\",\n    \"ANTHROPIC_API_KEY\": \"lm-studio\",\n    \"ANTHROPIC_AUTH_TOKEN\": \"lm-studio\",\n    \"ANTHROPIC_DEFAULT_SONNET_MODEL\": \"qwen2.5-coder-32b-instruct\",\n    \"ANTHROPIC_DEFAULT_HAIKU_MODEL\": \"qwen2.5-coder-32b-instruct\",\n    \"ANTHROPIC_DEFAULT_OPUS_MODEL\": \"qwen2.5-coder-32b-instruct\"\n  }\n}\n```\n\nHow to run:\n\n```\n# 1. Start the LM Studio server from the GUI (Local Server tab > Start Server)\n# 2. Set environment variables\nexport ANTHROPIC_BASE_URL=\"http://localhost:1234\"\nexport ANTHROPIC_API_KEY=\"lm-studio\"\nexport ANTHROPIC_AUTH_TOKEN=\"lm-studio\"\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"your-model-name-here\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"your-model-name-here\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"your-model-name-here\"\n# 3. Launch\nclaude\n```\n\n## # Backend 3: llama.cpp\n\n**llama.cpp** is the right choice when you need direct control over inference parameters — quantization type, KV cache configuration, batch size, thread count — or when you are running on a server and want the lowest overhead. It has native Anthropic Messages API support, so no proxy or translation layer is needed.\n\n**Prerequisites:**\n\n- A GGUF-format model file (download from Hugging Face; search for \"GGUF\" versions of any model)\n- CUDA-capable GPU for GPU inference, or CPU-only for slower inference\n- CMake and a C++ compiler for source builds (on Linux/CUDA, source is recommended)\n\nInstall llama.cpp:\n\n```\n# macOS -- Homebrew is simplest\nbrew install llama.cpp\n\n# Linux with CUDA -- build from source for best GPU performance\ngit clone https://github.com/ggml-org/llama.cpp\ncd llama.cpp\ncmake -B build -DGGML_CUDA=ON          # Enable CUDA acceleration\ncmake --build build --config Release   # Build\n# Binaries in ./build/bin/\n\n# Linux CPU-only build\ncmake -B build\ncmake --build build --config Release\n\n# Windows -- pre-built binaries available at:\n# https://github.com/ggml-org/llama.cpp/releases\n# Download the CUDA or CPU variant matching your hardware\n```\n\nDownload a GGUF model:\n\n```\n# Install the Hugging Face CLI if you do not have it\npip install huggingface-hub\n\n# Download GLM-4.7-Flash in Q4_K_XL quantization (~4.5 GB)\n# This quantization offers a good size/quality balance for coding\nhuggingface-cli download unsloth/GLM-4.7-Flash-GGUF \\\n  GLM-4.7-Flash-UD-Q4_K_XL.gguf \\\n  --local-dir ./models/\n\n# Or download Qwen3-Coder in Q4 quantization (~15 GB for 32B)\nhuggingface-cli download Qwen/Qwen3-Coder-32B-Instruct-GGUF \\\n  qwen3-coder-32b-instruct-q4_k_m.gguf \\\n  --local-dir ./models/\n```\n\nStart the llama.cpp server:\n\n```\n# Start llama-server with Anthropic API support and a 128K context window\nllama-server \\\n  --model ./models/GLM-4.7-Flash-UD-Q4_K_XL.gguf \\\n  --alias \"glm-4.7-flash\" \\          # This name goes in ANTHROPIC_DEFAULT_SONNET_MODEL\n  --port 8001 \\\n  --ctx-size 131072 \\                # 128K context -- important for large codebases\n  --flash-attn \\                     # Memory-efficient attention, improves speed\n  --n-gpu-layers 99                  # Offload all layers to GPU; remove for CPU-only\n\n# For CPU-only inference (no GPU):\nllama-server \\\n  --model ./models/GLM-4.7-Flash-UD-Q4_K_XL.gguf \\\n  --alias \"glm-4.7-flash\" \\\n  --port 8001 \\\n  --ctx-size 32768 \\                 # Reduce context size on CPU to keep memory manageable\n  --threads 8                        # Match your CPU core count\n```\n\nKey flags explained:\n\n`--alias`\n\n: the model name string Claude Code will send in requests. Set`ANTHROPIC_DEFAULT_SONNET_MODEL`\n\nto match this exactly.`--ctx-size`\n\n: context window in tokens.**131072 = 128K**. Larger is better for codebase analysis but uses more VRAM. Reduce if you get out-of-memory errors.`--flash-attn`\n\n: Flash Attention reduces peak VRAM by processing attention in smaller blocks. Enable it whenever your build supports it.`--n-gpu-layers 99`\n\n: offloads all transformer layers to the GPU. The server automatically uses fewer layers if VRAM is tight.\n\nConfigure Claude Code:\n\n```\nexport ANTHROPIC_BASE_URL=\"http://localhost:8001\"\nexport ANTHROPIC_API_KEY=\"llama-cpp\"\nexport ANTHROPIC_AUTH_TOKEN=\"llama-cpp\"\n\n# Must match the --alias you passed to llama-server exactly\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"glm-4.7-flash\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"glm-4.7-flash\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"glm-4.7-flash\"\n```\n\nHow to run:\n\n```\n# Terminal 1: start the llama.cpp server\nllama-server \\\n  --model ./models/GLM-4.7-Flash-UD-Q4_K_XL.gguf \\\n  --alias \"glm-4.7-flash\" \\\n  --port 8001 \\\n  --ctx-size 131072 \\\n  --flash-attn \\\n  --n-gpu-layers 99\n\n# Terminal 2: configure and launch Claude Code\nexport ANTHROPIC_BASE_URL=\"http://localhost:8001\"\nexport ANTHROPIC_API_KEY=\"llama-cpp\"\nexport ANTHROPIC_AUTH_TOKEN=\"llama-cpp\"\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"glm-4.7-flash\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"glm-4.7-flash\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"glm-4.7-flash\"\nclaude\n```\n\n## # The Complete `settings.json`\n\nEnvironment variable exports last only as long as the terminal session. For a durable configuration, use `~/.claude/settings.json`\n\n. Claude Code reads variables from this file at startup so they apply no matter how Claude was launched — from the terminal, from a VS Code task, or from a script.\n\nHere is a production-ready `settings.json`\n\nwith all variables explained:\n\n```\n{\n  \"env\": {\n    \"ANTHROPIC_BASE_URL\": \"http://localhost:11434\",\n\n    \"ANTHROPIC_API_KEY\": \"ollama\",\n    \"ANTHROPIC_AUTH_TOKEN\": \"ollama\",\n\n    \"ANTHROPIC_DEFAULT_SONNET_MODEL\": \"glm-4.7-flash:latest\",\n    \"ANTHROPIC_DEFAULT_HAIKU_MODEL\": \"glm-4.7-flash:latest\",\n    \"ANTHROPIC_DEFAULT_OPUS_MODEL\": \"glm-4.7-flash:latest\",\n\n    \"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS\": \"1\"\n  }\n}\n```\n\n**Why CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: \"1\" matters:**\n\nWhen using Claude Code through non-Anthropic backends, Claude Code adds Anthropic-specific experimental beta flags to request headers — flags that third-party and local servers do not recognize. This causes `Error: Unexpected value(s) for the anthropic-beta header`\n\non most local inference servers. Setting this variable to `\"1\"`\n\nstrips those headers before the request goes out, which eliminates the error without affecting any core Claude Code functionality.\n\n**Switching between backends:**\n\nIf you work with multiple backends — Ollama for daily use, the Anthropic API for complex tasks — the cleanest approach is maintaining separate shell scripts rather than editing `settings.json`\n\nback and forth:\n\n```\n# use-local.sh -- switch to Ollama\nexport ANTHROPIC_BASE_URL=\"http://localhost:11434\"\nexport ANTHROPIC_API_KEY=\"ollama\"\nexport ANTHROPIC_AUTH_TOKEN=\"ollama\"\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"glm-4.7-flash:latest\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"glm-4.7-flash:latest\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"glm-4.7-flash:latest\"\necho \"Claude Code → local Ollama (glm-4.7-flash)\"\n# use-anthropic.sh -- switch back to the Anthropic API\nunset ANTHROPIC_BASE_URL\nunset ANTHROPIC_AUTH_TOKEN\nunset ANTHROPIC_DEFAULT_SONNET_MODEL\nunset ANTHROPIC_DEFAULT_HAIKU_MODEL\nunset ANTHROPIC_DEFAULT_OPUS_MODEL\n# ANTHROPIC_API_KEY should already be set to your real key in your rc file\necho \"Claude Code → Anthropic API\"\n```\n\nSource either script in your current session:\n\n```\nsource ./use-local.sh\nclaude\n\n# When you need the real API for a complex task:\nsource ./use-anthropic.sh\nclaude\n```\n\n## # Best Local Models for Claude Code in 2026\n\nHardware is the main constraint. For Claude Code with local models to be genuinely usable for coding tasks rather than just a demo, aim for 32 GB of RAM — Apple Silicon unified memory or PC RAM. 16 GB is viable with smaller quantized models and CPU offload, but generation speed will be noticeably slower on multi-step agentic tasks.\n\nModel |\nVRAM Needed |\nContext |\nStrengths |\nLicense |\nPull Command |\n|---|---|---|---|---|---|\n|\n\n`ollama pull glm-4.7-flash`\n\n[devstral-small-2:24b](https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512)`ollama pull devstral-small-2:24b`\n\n[qwen3-coder](https://huggingface.co/collections/Qwen/qwen3-coder)`ollama pull qwen3-coder`\n\n[qwen3.5:27b](https://huggingface.co/Qwen/Qwen3.5-27B)`ollama pull qwen3.5:27b`\n\n[gemma4:26b](https://huggingface.co/google/gemma-4-26B-A4B)`ollama pull gemma4:26b`\n\n## # Troubleshooting Common Issues\n\n**Connection refused when launching Claude Code:** The inference server is not running. This is the most common issue and the easiest to diagnose.\n\n```\n# Check if Ollama is running\ncurl http://localhost:11434\n# Expected: \"Ollama is running\"\n\n# Check if LM Studio server is running\ncurl http://localhost:1234/v1/models\n# Should return a JSON list of loaded models\n\n# Check if llama-server is running\ncurl http://localhost:8001/health\n# Should return {\"status\":\"ok\"}\n\n# If not running -- start the server first, then launch Claude Code\nollama serve          # Ollama\n# LM Studio: use the GUI Local Server tab\n# llama.cpp: run the llama-server command from the Backend 3 section\n```\n\n**Model not found or unknown model error:** The model name in your`ANTHROPIC_DEFAULT_SONNET_MODEL`\n\ndoes not match what the server knows.\n\n```\n# List all models Ollama has available\nollama list\n\n# The model name in ANTHROPIC_DEFAULT_SONNET_MODEL must match EXACTLY\n# including the tag -- \"glm-4.7-flash:latest\" not \"glm-4.7-flash\"\n\n# Verify with a direct API call to confirm what the server sees\ncurl http://localhost:11434/v1/models\n```\n\n**Tool calls failing or returning errors:** For streaming tool calls, which Claude Code uses when executing functions or scripts, Ollama version 0.14.3-rc1 or later is required. Earlier versions in the 0.14.x series had incomplete streaming tool call support.\n\n```\n# Check your Ollama version\nollama version\n\n# If below 0.14.3, update Ollama\ncurl -fsSL https://ollama.com/install.sh | sh\n```\n\n`anthropic-beta`\n\nheader error:You will see:\n\n`Error: Unexpected value(s) for the anthropic-beta header`\n\n. This happens because Claude Code adds Anthropic-specific experimental beta flags that local servers do not recognize. Fix it by adding this to your`settings.json`\n\nenv block:\n\n```\n\"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS\": \"1\"\n```\n\n**Reverting to the Anthropic API:**\n\n```\n# Shell session -- unset the redirect variables\nunset ANTHROPIC_BASE_URL\nunset ANTHROPIC_AUTH_TOKEN\nunset ANTHROPIC_DEFAULT_SONNET_MODEL\nunset ANTHROPIC_DEFAULT_HAIKU_MODEL\nunset ANTHROPIC_DEFAULT_OPUS_MODEL\n\n# Then make sure your real API key is set\necho $ANTHROPIC_API_KEY\n# Should show your sk-ant-... key, not a placeholder\n\n# If you used settings.json -- remove or comment out the env block\n# and restart Claude Code\n```\n\n**Slow generation speed:** For agentic Claude Code tasks, generation speed matters because each tool call is a round trip. If speed is inadequate:- Switch to a smaller or more aggressively quantized model (Q4_K_M instead of Q8).\n- Enable\n`--flash-attn`\n\nin llama.cpp if not already set. - Reduce context size (\n`--ctx-size`\n\n); larger contexts are slower to prefill. - On Ollama, set\n`OLLAMA_NUM_GPU_LAYERS=99`\n\nin your environment to force maximum GPU offload.\n\n## # Conclusion\n\nWhat used to require fragile adapters and hacks is now a five-step process. Install the inference backend, pull a model, set three environment variables, and Claude Code routes to your local machine instead of Anthropic's API. The configuration takes under five minutes once you have the model downloaded.\n\nThe practical result is a coding assistant that costs nothing to run after setup, has no rate limits, keeps your code entirely on your machine, and covers the vast majority of real coding use cases at quality levels that were not available in local models a year ago. Start with Ollama and `glm-4.7-flash`\n\n— it has the lowest hardware requirement, the most consistent tool-calling support, and the fastest path to a working setup. Once that is running, scale up the model based on your hardware and the quality level you actually need.\n\nis a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on\n\n[Shittu Olumide](https://www.linkedin.com/in/olumide-shittu/)", "url": "https://wpnews.pro/news/pairing-claude-code-with-local-models", "canonical_source": "https://www.kdnuggets.com/pairing-claude-code-with-local-models", "published_at": "2026-06-12 14:00:59+00:00", "updated_at": "2026-06-12 14:55:59.954940+00:00", "lang": "en", "topics": ["ai-tools", "large-language-models", "ai-infrastructure", "ai-products", "artificial-intelligence"], "entities": ["Claude Code", "Ollama", "LM Studio", "llama.cpp"], "alternates": {"html": "https://wpnews.pro/news/pairing-claude-code-with-local-models", "markdown": "https://wpnews.pro/news/pairing-claude-code-with-local-models.md", "text": "https://wpnews.pro/news/pairing-claude-code-with-local-models.txt", "jsonld": "https://wpnews.pro/news/pairing-claude-code-with-local-models.jsonld"}}