{"slug": "save-claude-code-tokens-with-smart-routing", "title": "Save Claude Code Tokens with Smart Routing", "summary": "Brick, a Mixture-of-Models routing gateway, reads each prompt's capability and complexity to route queries to the best backend LLM, matching top model quality at lower cost. It offers a drop-in OpenAI-compatible endpoint for tools like Claude Code and Codex, cutting costs without losing quality.", "body_md": "Brick is a **Mixture-of-Models (MoM) routing gateway**. It reads each prompt's\n**capability** and **complexity**, then routes it to the best backend in a pool of\nopen- and closed-weight LLMs, matching the strongest single model's quality at a\nfraction of its cost. No cascades. No wasted calls. Drop-in `model: \"brick\"`\n\n.\n\n[When to use Brick](#-when-can-i-use-brick) · [Quickstart](#-quickstart) · [Why Brick](#-why-brick) · [Claude Code](#-brick--claude-code) · [Codex](#-use-it-on-codex) · [FAQ](#-faq) · [Benchmarks](#-results-dataset-a-n5504) · [How it works](#-how-it-works) · [Paper](#-paper)\n\nBrick is for anyone running against more than one model, or paying flat rate for a single strong one. Three common cases:\n\n-\n**You have a pool of models and want each query to reach the right one.** Cheap prompts should not burn your most expensive model, and hard prompts should not be starved on a small one. Brick reads capability and complexity per query and dispatches accordingly, so the pool works as one graded system instead of a manual pick. -\n**You want to cut Claude Code / Codex costs without losing quality.** Put Brick in front of your coding agent and every request is routed to the cheapest model that can actually do the job, escalating only when the task needs it. You keep the same UX and pay for the hard turns, not the easy ones. -\n**You want to unify different models behind one tool.** Use OpenAI models, GLM, DeepSeek, Kimi, Qwen and others from inside Claude Code or Codex through a single OpenAI-compatible endpoint. Define the pool once in`config.yaml`\n\nand call`model: \"brick\"`\n\neverywhere.\n\nThe fastest working path today is the CLI, which self-hosts the router and wires it into\n**Claude Code** for you. Requires Node >= 18 and Docker.\n\n```\ngit clone https://github.com/regolo-ai/brick-SR1.git\ncd brick-SR1/apps/cli && npm install && npm run build && npm link\n\nbrick claude on     # starts the router + wires ANTHROPIC_BASE_URL in ~/.claude/settings.json\n```\n\nThen open a **new** Claude Code session and pick **brick-claude** in the `/model`\n\npicker.\nEvery request now routes to haiku / sonnet / opus by capability and complexity. See\n[Brick + Claude Code](#-brick--claude-code) for modes, the effort picker, and the live\n`brick claude status`\n\ndashboard.\n\n**Prefer a raw OpenAI-compatible gateway (no CLI)?**\n\nOnce the Docker image is published (see [Distribution channels](#-develop)), you'll\nbe able to run the gateway directly:\n\n```\ndocker run --rm -p 18000:18000 \\\n  -e REGOLO_API_KEY=$REGOLO_API_KEY \\\n  ghcr.io/regolo-ai/brick:latest      # published at the next v2.1.0 tag\n```\n\nThen call it like any OpenAI endpoint, just set `\"model\": \"brick\"`\n\n:\n\n```\ncurl http://localhost:18000/v1/chat/completions \\\n  -H \"Authorization: Bearer $REGOLO_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"brick\",\"messages\":[{\"role\":\"user\",\"content\":\"Prove that sqrt(2) is irrational\"}]}'\n```\n\nThe `x-selected-model`\n\nresponse header tells you which backend Brick picked.\nThat math prompt routes to a reasoning model; `\"Hello\"`\n\nroutes to the cheapest one.\n\nUntil then, `brick serve`\n\n(from the CLI above) runs the same router locally from source.\n\n| Single model | RouteLLM | FrugalGPT / Cascade | Brick |\n|\n|---|---|---|---|---|\n| One call per query (no cascade waste) | ✅ | ✅ | ❌ | ✅ |\n| Capability-aware (6 dimensions) | n/a | ❌ binary | ❌ | ✅ |\n| Complexity-aware | n/a | partial | ✅ | ✅ |\n| Pool of N open + closed models | n/a | 2 | few | ✅ |\n| Continuous cost ↔ quality knob | ❌ | ❌ | threshold | ✅ `r ∈ [-1, 1]` |\n| Native multimodal (image / audio) | varies | ❌ | ❌ | ✅ |\n| Drop-in OpenAI-compatible | n/a | n/a | n/a | ✅ |\n\nCascade routers (FrugalGPT, Cascade Routing) call models one after another until a\nconfidence check passes, paying for every miss in tokens and latency. Brick makes a\n**single forward decision** per query, so there is nothing to waste.\n\n## gosmiulator.mp4\n\nPut one OpenAI/Anthropic-compatible endpoint in front of Claude Code, and Brick routes every request to **haiku**, **sonnet**, or **opus** based on capability and complexity. You keep the Claude Code UX; Brick picks the cheapest model that can do the job.\n\n```\nbrick claude on     # wires ANTHROPIC_BASE_URL in ~/.claude/settings.json, auto-starts the router\n```\n\nThen:\n\n- Open a\n**new** Claude Code session (your current session is unaffected). - In the\n`/model`\n\npicker, select**brick-claude**(it sits alongside the built-in opus/sonnet/haiku aliases, which it does not replace).\n\nTo revert:\n\n```\nbrick claude off    # restores ANTHROPIC_BASE_URL, optionally stops the router\n```\n\nUse `brick claude on --no-start`\n\nto require an already-healthy router instead of auto-starting one, and `brick claude off --stop`\n\n/ `--keep`\n\nto control the router without a prompt.\n\nA mode is how you tell Brick how much to spend. Each one maps easy/medium/hard queries to a model tier, from cheapest (`eco`\n\n, always haiku) to strongest (`max`\n\n, always opus), with `lite`\n\n, `mid`\n\nand `pro`\n\nin between. Pick one and Brick handles the per-query routing inside it.\n\n## 2026-07-03.23-55-05.mp4\n\nYou switch mode straight from the **thinking effort** slider in Claude Code's `/model`\n\npicker: low picks `eco`\n\n, medium `lite`\n\n, high `mid`\n\n, xhigh `pro`\n\n, and max `max`\n\n. So the effort control does not set a thinking budget, it selects the model tier. You can also switch explicitly with `brick claude mode`\n\nor `brick claude <mode>`\n\n.\n\n`mid`\n\nis the default. On 1M-context requests the map shifts up since Haiku has no 1M variant: easy and medium resolve to sonnet, hard to opus.\n\nOnce you have picked the tier, how hard to think is decided **autonomously per request** from the router's own signals (query difficulty plus the chosen model's headroom).\n\nSelecting **opus**, **sonnet**, or **haiku** explicitly in the picker skips Brick entirely: the request is forwarded verbatim to that exact model, with no skill routing and no effort override. Only **brick-claude** runs the router.\n\n```\nbrick claude status         # live dashboard (default in an interactive terminal)\nbrick claude status --once  # static one-shot view\n```\n\nThe dashboard reports, since the last router restart:\n\n**Routed by model**: count and percent per model.** Per-model effort distribution**: how reasoning effort spread out within each model.** Difficulty mix**: the classifier's easy/medium/hard verdicts across routed requests.** Economy**: an estimated`saved ~X% vs all-opus`\n\nover the routed request count (a relative estimate from request mix, excluding real token counts and caching).\n\nIt also shows connection/wiring state, classifier latency (avg, p50, p95), and fallback rate.\n\nBrick routing is per request. In Claude Code workflows and subagents, each agent's call is routed **independently** as long as that agent uses **brick-claude**, so a cheap subagent task can land on haiku while a hard one escalates to opus in the same run.\n\nThe same idea behind OpenAI Codex: Brick sits in front of Codex and routes each request across your model pool, so you cut cost on easy turns and can drive Codex with non-OpenAI models through one OpenAI-compatible endpoint.\n\n```\nbrick codex on      # sets model/model_provider to brick in ~/.codex/config.toml, auto-starts the router\n```\n\nThis materializes a dedicated Codex profile (the OpenAI-pool skill router) and adds a managed provider pointing at the local router. Start a new Codex session and it now routes through Brick.\n\nTo revert:\n\n```\nbrick codex off     # restores your previous Codex model/provider\n```\n\nCodex exposes the same 5 modes and status view as Claude Code:\n\n```\nbrick codex mode           # or: brick codex eco | lite | mid | pro | max\nbrick codex status         # live routing dashboard\n```\n\nUse `brick codex on --no-start`\n\nto require an already-healthy router instead of auto-starting one. The Claude and Codex router stacks share host port 8000, so only one can serve at a time; stop the other before wiring.\n\nYou do not need a coding agent. Brick is a plain OpenAI-compatible gateway you can call from any client, script, or app.\n\n```\nbrick serve                       # docker compose up on http://localhost:18000\nbrick chat                        # TUI chat against the local router\nbrick route \"what is 2+2?\"        # print the routing decision for a prompt, no call made\n```\n\nCall it like any OpenAI endpoint, just set `\"model\": \"brick\"`\n\n:\n\n```\ncurl http://localhost:18000/v1/chat/completions \\\n  -H \"Authorization: Bearer $REGOLO_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"brick\",\"messages\":[{\"role\":\"user\",\"content\":\"Prove that sqrt(2) is irrational\"}]}'\n```\n\nThe `x-selected-model`\n\nresponse header tells you which backend Brick picked. That math prompt routes to a reasoning model; `\"Hello\"`\n\nroutes to the cheapest one.\n\nEverything Brick decides comes from `config.yaml`\n\n. The core block is `skill_router`\n\n, where you declare the pool, each model's skill vector, and its cost weight:\n\n```\nskill_router:\n  enabled: true\n  capabilities:                 # the 6 dimensions every query and model live in\n    - coding\n    - creative_synthesis\n    - instruction_following\n    - math_reasoning\n    - planning_agentic\n    - world_knowledge\n\n  models:\n    - model: \"qwen3.5-9b\"\n      skill_vector: [0.71, 0.51, 0.81, 0.91, 0.58, 0.18]   # capability per dimension\n      use_reasoning: false\n      cost_weight: 0.10                                     # relative price, drives the cost bias\n    - model: \"deepseek-v4-flash\"\n      skill_vector: [0.82, 0.66, 0.86, 0.93, 0.62, 0.49]\n      use_reasoning: false\n      cost_weight: 0.40\n    - model: \"kimi2.6\"\n      skill_vector: [0.90, 0.75, 0.87, 0.94, 0.64, 0.34]\n      use_reasoning: true\n      reasoning_effort: \"medium\"\n      cost_weight: 0.60\n```\n\nAdd or swap any OpenAI-compatible backend here; the backends themselves are declared under `provider_profiles`\n\n/ `model_config`\n\n(the shipped config points them all at Regolo). Two more blocks let you nudge routing without touching the math:\n\n```\n  keyword_rules:\n    - name: \"force_coder\"       # hard override: send these prompts to a specific model\n      mode: \"override\"\n      model: \"kimi2.6\"\n      operator: \"OR\"\n      keywords: [\"debug\", \"refactor\", \"compile\", \"write a function\"]\n    - name: \"coding_bias\"       # soft nudge: push one capability dimension up\n      mode: \"bias\"\n      capability: \"coding\"\n      operator: \"OR\"\n      keywords: [\"python\", \"rust\", \"sql\", \"async\"]\n```\n\nOther useful sections: `brick`\n\n(multimodal preprocessing: STT, OCR, vision), the `r`\n\npreference knob in `r ∈ [-1, 1]`\n\n(max-saving to max-quality), and the classifier endpoints. The CLI can edit most of this for you (`brick add model`\n\n, `brick config edit`\n\n), or edit the YAML directly. Full field reference: [apps/router/README.md](/regolo-ai/brick-SR1/blob/main/apps/router/README.md).\n\nA monorepo to **run**, **use**, and **reproduce** every result in the Brick paper.\n\n| Component | Path | Purpose |\n|---|---|---|\nRouter (Go + Rust) |\n`apps/router/` |\n\n**CLI**(`brick`\n\n)`apps/cli/`\n\n**Training**`packages/training/`\n\n**Evaluation**`packages/evals/`\n\n**Baselines**`packages/evals/baselines/`\n\n**Paper**`docs/paper/`\n\n## Full directory tree\n\n```\nbrick-SR1/\n├── apps/\n│   ├── router/                 # Go + Rust gateway (was vLLM Spatial Router fork)\n│   │   ├── src/spatial-router/ #   Go (HTTP proxy, routing pipeline)\n│   │   ├── candle-binding/     #   Rust (ML embeddings via candle)\n│   │   ├── ml-binding/         #   Rust (Linfa classical ML)\n│   │   ├── nlp-binding/        #   Rust (BM25 + n-gram)\n│   │   └── Dockerfile\n│   └── cli/                    # @regolo-ai/brick CLI (TypeScript + oclif + ink)\n├── packages/\n│   ├── training/               # Dataset B pipeline + ModernBERT/complexity training\n│   ├── evals/                  # Dataset A graders + 00..140 pipeline + baselines/\n│   └── datasets/               # HF download recipes (no data in git)\n├── docs/\n│   ├── paper/                  # paper.tex + figures + compiled PDF\n│   └── quickstart/             # quick.md, serve.md, eval.md\n├── deploy/                     # docker-compose, addons, Windows installer\n├── config.yaml                 # router runtime config\n├── package.json / pyproject.toml  # npm + uv workspace roots\n└── Makefile                    # build / test / lint / docker-build / release\nmake install   # npm install (apps/cli) + uv sync (packages/*)\nmake build     # CLI + router Docker image\nmake test      # Go tests + Python pytest + CLI vitest\nmake lint      # pre-commit run --all-files\n```\n\nPer-component docs: [router](/regolo-ai/brick-SR1/blob/main/apps/router/README.md) · [CLI](/regolo-ai/brick-SR1/blob/main/apps/cli/README.md) · [training](/regolo-ai/brick-SR1/blob/main/packages/training/README.md) · [evals](/regolo-ai/brick-SR1/blob/main/packages/evals/README.md) · [datasets](/regolo-ai/brick-SR1/blob/main/packages/datasets/README.md) · [baselines](/regolo-ai/brick-SR1/blob/main/packages/evals/baselines/README.md).\n\n## Distribution channels (work in progress)\n\n| Channel | Status |\n|---|---|\nSource clone + `npm link` |\navailable |\nDocker GHCR (`ghcr.io/regolo-ai/brick` ) |\npending first push (tag `v2.1.0` ) |\nnpm (`@regolo-ai/brick` ) |\npending `NPM_TOKEN` secret |\nDocker Hub mirror (`docker.io/regolo/brick` ) |\npending Docker Hub secrets |\n\n**How is Brick different from a cascade router like FrugalGPT?**\n\nA cascade calls models in sequence (cheap first, escalate on low confidence) and pays for every miss in tokens and latency. Brick makes a single forward decision per query from a capability vector and a complexity score, so there is no wasted call. See [Why Brick](#-why-brick).\n\n**Which backend did Brick pick for my request?**\n\nRead the `x-selected-model`\n\nresponse header. Every `/v1/chat/completions`\n\nand `/v1/messages`\n\nresponse carries it.\n\n**How do I trade cost against quality?**\n\nSlide the `r`\n\nknob in `r ∈ [-1, 1]`\n\n. At `r = -1`\n\nBrick favors the cheapest capable model (max-saving), at `r = 1`\n\nit favors the strongest (max-quality). For Claude Code the same idea is exposed as 5 named modes, see [the 5 modes](#the-5-modes).\n\n**Do I need GPUs to run the gateway?**\n\nNo. The router and both classifiers run on CPU. GPUs only matter if you self-host the backend LLMs; with a hosted pool (Regolo, Anthropic, etc.) a CPU box is enough.\n\n**Can I use my own model pool?**\n\nYes. The pool, per-model skill vectors, costs, and the `model_map`\n\nlive in `config.yaml`\n\n(`skill_router.models`\n\n). Add or swap any OpenAI-compatible backend. See [apps/router/README.md](/regolo-ai/brick-SR1/blob/main/apps/router/README.md).\n\n**What is the upstream for the OpenAI-compatible endpoint failing with 401/insufficient_quota?**\n\nThat error comes from the backend provider, not Brick. Check the credential you forward (`REGOLO_API_KEY`\n\nor your own key); Brick passes Authorization through unchanged.\n\nContributions are welcome. The short loop:\n\n```\nmake install   # deps for CLI + Python workspaces\nmake test      # Go + pytest + vitest, run before opening a PR\nmake lint      # pre-commit run --all-files\n```\n\n- Open an\n[issue](https://github.com/regolo-ai/brick-SR1/issues)to discuss non-trivial changes first. - Branch from\n`main`\n\n, keep commits focused, follow the existing style of the files you touch. - Make sure\n`make test`\n\nand`make lint`\n\npass. - Open a PR with a clear description of the what and the why.\n\nFor architecture and per-component conventions, start from [What's in the repo](#-whats-in-the-repo) and the component READMEs linked under [Develop](#-develop).\n\nEverything below reproduces the research behind Brick: the benchmark numbers, the routing algorithm, the datasets and models, and the paper itself.\n\nBrick sits on the **Pareto frontier** of cost vs quality, dominating single-model baselines and prior routers (RouteLLM, FrugalGPT, Cascade Routing) and approaching the oracle ceiling.\n\n| Setting | Accuracy | Cost (× cheapest) | Latency (avg) |\n|---|---|---|---|\n| Always Qwen3.5-9b | 65.4% | 1.0× | 8.1 s |\n| Always DeepSeek-v4-flash | 71.2% | 4.0× | 14.7 s |\n| Always Kimi2.6 | 75.02% | 6.0× | 51.2 s |\nBrick (max-quality) |\n76.98% |\n1.5× |\n22.8 s |\nBrick (max-saving) |\n72.4% | 1.0× |\n9.4 s |\nOracle bound (3-model pool) |\n83.25% |\nn/a |\nn/a |\n\n**Brick beats always-Kimi at ~4× lower cost and roughly half the latency.** Inter-rater agreement on the 3-judge eval panel: κ = 0.761. Full per-dimension breakdown and baseline reproduction in [ packages/evals/baselines/RESULTS.md](/regolo-ai/brick-SR1/blob/main/packages/evals/baselines/RESULTS.md).\n\nFor every request the router computes a **capability vector** and a **complexity score**, then picks the model whose skill profile is closest to what the query needs.\n\n``` php\nflowchart LR\n  Q([Query]) --> C[Capability classifier<br/>ModernBERT → p&#40;x&#41; ∈ Δ⁶]\n  Q --> X[Complexity classifier<br/>Qwen3.5-0.8B + LoRA → τ]\n  C --> R{{Skill-distance argmin<br/>Jₘ = Dₘ + β·aₘ}}\n  X --> R\n  R --> M1[qwen3.5-9b]\n  R --> M2[deepseek-v4-flash]\n  R --> M3[kimi2.6]\n```\n\nThe query and each model live as vectors in the same capability space. The winner is the model whose skill vector is nearest to the query's needs, biased by a cost term:\n\n**Capability**`p(x) ∈ Δ⁶`\n\n: soft assignment over`coding`\n\n,`creative_synthesis`\n\n,`instruction_following`\n\n,`math_reasoning`\n\n,`planning_agentic`\n\n,`world_knowledge`\n\n().`brick-modernbert-capability-classifier`\n\n**Complexity**`τ ∈ {easy, medium, hard}`\n\n(, Qwen3.5-0.8B + LoRA).`brick-complexity-2-eco`\n\n**Objective** per model:`Jₘ = Dₘ + β·aₘ`\n\n, distance`Dₘ = ‖p(x) − sₘ‖`\n\nplus normalized cost`aₘ`\n\n.**Argmin** over the pool → selected backend. The`r`\n\nknob slides the whole pool from max-saving to max-quality.\n\nMultimodal inputs are preprocessed (OCR, Whisper-compatible STT) then routed as text, or forwarded directly to a vision model. Details in [apps/router/README.md](/regolo-ai/brick-SR1/blob/main/apps/router/README.md) and the [paper](/regolo-ai/brick-SR1/blob/main/docs/paper/paper.pdf) §3.\n\n## Full evaluation pipeline (Dataset A, 5,504 queries)\n\n```\ngit clone https://github.com/regolo-ai/brick-SR1 && cd brick-SR1\n\nuv sync                                                  # Python workspaces\ncd apps/cli && npm install && cd ../..                   # CLI\n\n# Download HF artifacts (datasets + models)\npython packages/datasets/scripts/download_dataset_a.py --out ./data/dataset_a\npython packages/datasets/scripts/download_models.py     --out ./models\n\n# Inference + grading\npython packages/evals/scripts/100_run_inference.py  --config packages/evals/configs/protocols.yaml\npython packages/evals/scripts/110_grade_inference.py\npython packages/evals/scripts/130_aggregate_results.py | tee results.txt\n\n# Expected: Brick max-quality ≈ 76.98% accuracy, oracle bound ≈ 83.25%\n```\n\nFull pipeline (judges, baselines, cost/Pareto analysis): [docs/quickstart/eval.md](/regolo-ai/brick-SR1/blob/main/docs/quickstart/eval.md).\n\n| Artifact | HF Repo | Type | Notes |\n|---|---|---|---|\n| Dataset A (eval) |\n`regolo/brick-dataset-A-routing-eval` |\n\n`massaindustries/dataset-B-modernbert-train`\n\n`regolo/brick-modernbert-capability-classifier`\n\n`regolo/brick-complexity-2-eco`\n\nDownload recipes: [ packages/datasets/](/regolo-ai/brick-SR1/blob/main/packages/datasets).\n\nBrick and the Mixture-of-Models (MoM) Paradigm: Bridging Open- and Closed-Weight LLM PoolsFrancesco Massa, Marco Cristofanilli (2026) · Built at[Regolo.ai](Seeweb)\n\nPre-built PDF: [ docs/paper/paper.pdf](/regolo-ai/brick-SR1/blob/main/docs/paper/paper.pdf) · compile with\n\n`cd docs/paper && latexmk -pdf paper.tex`\n\n.\n\n```\n@misc{massa2026brick,\n  title  = {Brick and the Mixture-of-Models ({MoM}) Paradigm:\n            Bridging Open- and Closed-Weight {LLM} Pools},\n  author = {Massa, Francesco and Cristofanilli, Marco},\n  year   = {2026},\n  url    = {https://github.com/regolo-ai/brick-SR1}\n}\n```\n\n", "url": "https://wpnews.pro/news/save-claude-code-tokens-with-smart-routing", "canonical_source": "https://github.com/regolo-ai/brick-SR1", "published_at": "2026-07-03 22:39:24+00:00", "updated_at": "2026-07-03 22:49:30.566037+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools", "developer-tools"], "entities": ["Brick", "Claude Code", "Codex", "OpenAI", "Anthropic", "Regolo AI"], "alternates": {"html": "https://wpnews.pro/news/save-claude-code-tokens-with-smart-routing", "markdown": "https://wpnews.pro/news/save-claude-code-tokens-with-smart-routing.md", "text": "https://wpnews.pro/news/save-claude-code-tokens-with-smart-routing.txt", "jsonld": "https://wpnews.pro/news/save-claude-code-tokens-with-smart-routing.jsonld"}}