{"slug": "maslul-smart-llm-router-one-call-the-right-model", "title": "Maslul – Smart LLM router – one call, the right model", "summary": "Maslul, a new open-source Python library, provides smart LLM routing and provider normalization across Anthropic, Gemini, Grok, and OpenAI, allowing developers to route each request to the right model tier by difficulty without hardcoding model choices or rewriting plumbing for each provider. The library is async, fully typed, and embeddable as a library rather than a gateway, with features including difficulty-based routing, tool use, structured output, web search, and caching.", "body_md": "**Smart LLM router — one call, the right model.**\n\nAsync and fully typed, across Anthropic, Gemini, xAI Grok, and OpenAI — routing each request to the right model tier by difficulty. Stop hardcoding model choices and stop re-writing the tool-use / structured-output / web-search / retry plumbing for every provider.\n\n`maslul`\n\n(Hebrew *מסלול*, \"route / lane\") is a small library that does exactly two things:\n**routing** (pick a model tier per request, or pin one) and **provider normalization** (one\n`Request`\n\n/`Response`\n\nshape for every SDK). No server, no CLI, no heavy ML deps — providers\nlive behind extras, and the core is stdlib-only.\n\n``` python\nimport asyncio\nfrom maslul import Router, Request, Message\n\nrouter = Router.from_toml(\"maslul.toml\")           # tiers + classifier + providers, from config\n\nasync def main() -> None:\n    resp = await router.complete(Request(messages=[Message(role=\"user\", content=\"Hello!\")]))\n    print(resp.text, \"·\", resp.level_used, \"·\", resp.usage.output_tokens, \"tokens\")\n\nasyncio.run(main())\npip install \"maslul[anthropic,gemini,grok]\"     # or just the providers you use\n```\n\nEach provider's SDK lives behind an extra, so `import maslul`\n\npulls in **none** of them — you\nonly install what you route to. `maslul[anthropic]`\n\n→ `anthropic`\n\n; `maslul[gemini]`\n\n→\n`google-genai`\n\n; `maslul[grok]`\n\n→ `xai-sdk`\n\n; `maslul[openai]`\n\n→ `openai`\n\n.\n\nmaslul is **a library, not a gateway** — you embed the routing brain in your app, you don't run a\nproxy in front of it.\n\nmaslul |\nRouteLLM |\nLiteLLM |\n|\n|---|---|---|---|\n| Shape | async library you embed (no server) | research framework / trained router | unified SDK + proxy server |\n| Routing | difficulty tiers + swappable strategies (`route_default` / `classify` / `classify_and_answer` / `verify_cascade` ) + injectable `bypass` / `classifier` / `verifier` hooks |\na trained strong-vs-weak router |\nmanual config / fallback lists, load-balancing |\n| Providers | Anthropic · Gemini · Grok · OpenAI, normalized |\nmodel-agnostic (you wire models) | 100+ providers |\n| Tools / structured / vision | one normalized loop for all | — | per-provider |\nWeb search |\none flag, every provider → `Response.sources` |\n— | per-provider |\n| Caching | exact + semantic (in-process) |\n— | exact + semantic (proxy) |\n| Typing / footprint | fully typed, `py.typed` ; stdlib core, SDKs behind extras |\nresearch code | larger; server to operate |\n\n**Choose maslul when** you want a typed async library you embed — difficulty routing with your own\nstrategy + hooks, and one `Request`\n\n/`Response`\n\nover several providers (tools, structured output,\nvision, **web search**, retries, cost cache) — *without* standing up a gateway. Reach for **LiteLLM**\nwhen you want a provider proxy across 100+ models, or **RouteLLM** when you specifically want a\ntrained router.\n\n``` php\nflowchart LR\n    R[\"complete(req)\"] --> M{\"model= pin?\"}\n    M -- yes --> RUN[\"run that model\"]\n    M -- no --> L{\"level= pin?\"}\n    L -- yes --> RUN\n    L -- no --> B{\"bypass_predicate?\"}\n    B -- \"tier\" --> RUN\n    B -- \"None\" --> H{\"hard_signal?<br/>(media · code · long · intent verbs)\"}\n    H -- \"yes\" --> HARD[\"HARD tier\"] --> RUN\n    H -- \"no\" --> S[\"strategy<br/>route_default · classify ·<br/>classify_and_answer · verify_cascade\"] --> RUN\n    RUN --> X[\"tool loop · web search ·<br/>retry / fallback · usage breakdown\"]\n```\n\nDifficulty is **not** readable from surface features — a short prompt can be very hard, a long\npaste trivial — so maslul never applies a `short ⇒ simple`\n\nrule. You choose how each request is\nrouted, in this precedence order:\n\n``` python\nfrom maslul import Level\n\nawait router.complete(req, model=\"anthropic:claude-opus-4-8\")  # 0. pin an exact model\nawait router.complete(req, level=Level.HARD)                   # 1. pin a difficulty tier\nawait router.complete(req)                                     # 2-4. let the router decide\n```\n\nWhen you don't pin, the **routing brain** runs: a deterministic **bypass** (your fast-path, e.g.\ngreetings → SIMPLE) → a **hard-signal** detector (intent verbs, code, attachments, long context →\nHARD, *up-only*) → the configured **strategy** for the ambiguous middle:\n\n| Strategy | Cost for the middle | What it does |\n|---|---|---|\n`ROUTE_DEFAULT` |\n0 calls | Default-to-capable (`default_level` ). Best for low volume. |\n`CLASSIFY` |\n1 classify + 1 answer | A cheap dedicated classifier model labels the level (cached + budget-guarded), then dispatch. |\n`CLASSIFY_AND_ANSWER` |\n1 call | The classifier model answers directly, or emits an escalation sentinel to bump to a stronger tier. |\n`VERIFY_CASCADE` |\n1 cheap + verify | Answer cheap, run your verifier, escalate if it rejects — catches silent under-escalation. |\n\nAll three injection points are yours to supply:\n\n``` python\ndef my_classifier(req):      # your own difficulty call (sync or async); None defers to the strategy\n    return Level.SIMPLE if is_trivial(req) else None\n\ndef my_verifier(req, resp):  # VERIFY_CASCADE: True keeps the cheap answer, False escalates\n    return \"I don't know\" not in resp.text\n\nrouter = Router.from_toml(\"maslul.toml\", classifier=my_classifier, verifier=my_verifier)\n```\n\nThe same `Request`\n\n/`Response`\n\nworks across all three providers:\n\n``` python\nfrom maslul import Request, Message, ToolDef, ToolCall, MediaPart\n\n# Tools — the router runs a provider-agnostic tool-use loop\nasync def get_weather(call: ToolCall) -> str:\n    return f\"18°C in {call.input['city']}\"\n\nreq = Request(\n    messages=[Message(role=\"user\", content=\"Weather in Paris?\")],\n    tools=[ToolDef(name=\"get_weather\", description=\"Current weather for a city.\",\n                   input_schema={\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}},\n                                 \"required\": [\"city\"]})],\n    tool_executor=get_weather,\n)\n\n# Structured output — response_format → resp.structured (parsed)\nreq = Request(messages=[Message(role=\"user\", content=\"Extract name + age\")],\n              response_format={\"type\": \"object\", \"properties\": {\"name\": {\"type\": \"string\"},\n                                                                \"age\": {\"type\": \"integer\"}}})\n\n# Vision — images / PDFs\nreq = Request(messages=[Message(role=\"user\", content=\"What's in this image?\")],\n              media=[MediaPart(mime_type=\"image/png\", data=png_bytes)])\n\n# Web search — one flag, grounded on ANY provider (Anthropic web_search / Gemini Google Search /\n# Grok Agent Tools); citations land in resp.sources regardless of which model answers.\nreq = Request(messages=[Message(role=\"user\", content=\"Latest news on X?\")], web_search=True)\npython\ndef on_usage(resp):                         # per-model token breakdown for monitoring\n    for rec in resp.usage_records:\n        metrics.incr(f\"{rec.provider}:{rec.model}\", rec.usage.output_tokens)\n\nrouter = Router.from_toml(\"maslul.toml\", on_complete=on_usage)\n```\n\nTransient errors (`RateLimited`\n\n, `Timeout`\n\n) retry with exponential backoff; on persistent failure\nthe request **falls back to the next-higher tier** — which may be a different provider, giving you\ncross-provider failover for free. `AuthError`\n\nfails fast. Hooks: `on_route`\n\n(the `RoutingDecision`\n\n),\n`on_complete`\n\n(the final `Response`\n\nwith `usage_records`\n\n), `on_error`\n\n(each failed attempt).\n\nBuild a router with `missing_provider=\"degrade\"`\n\nand any tier whose provider isn't configured\n(e.g. a Grok tier with no `XAI_API_KEY`\n\n) **falls back to the nearest available tier** instead of\nerroring — so one config runs across deploys that have different keys.\n\nA `[maslul.cache]`\n\nconfig returns a prior `Response`\n\ninstead of calling a model — `exact`\n\n(identical\nrequest) or `semantic`\n\n(nearest request above a cosine threshold, using an embedder you inject, since\nmaslul ships no embeddings). A hit comes back with `cached=True`\n\nand **zeroed usage**, so monitoring\nsees the saving. Tool-using requests are never cached.\n\n```\n[maslul.cache]\nmode = \"semantic\"          # off | exact | semantic\nmax_entries = 1000\nttl_seconds = 86400\nsimilarity_threshold = 0.95\nrouter = Router.from_toml(\"maslul.toml\", embed=my_async_embed)   # embed only needed for semantic\n```\n\nA TOML file (or a plain `dict`\n\n— `Router(config={...})`\n\n):\n\n```\n[maslul]\nstrategy = \"route_default\"        # route_default | classify | classify_and_answer | verify_cascade\ndefault_level = \"hard\"            # default-to-capable for the ambiguous middle\nmin_tokens_to_classify = 40       # CLASSIFY budget guard\nrequest_timeout = 60              # per-call seconds (optional)\nmax_retries = 2\nfallback = true                   # escalate to a higher tier on persistent failure\n\n[maslul.tiers.simple]\nprovider = \"gemini\"\nmodel = \"gemini-2.5-flash-lite\"\n[maslul.tiers.medium]\nmodel = \"anthropic:claude-haiku-4-5\"   # or the provider:model shorthand\n[maslul.tiers.hard]\nmodel = \"anthropic:claude-sonnet-4-6\"\n\n[maslul.classifier]               # required for the classify strategies\nmodel = \"anthropic:claude-haiku-4-5\"\n\n[maslul.providers.anthropic]\napi_key_env = \"ANTHROPIC_API_KEY\"      # secrets by env-var name, never inlined\n[maslul.providers.gemini]\nvertex_project = \"my-gcp-project\"      # Vertex AI + Application Default Credentials (no key)\nvertex_location = \"global\"\n[maslul.providers.grok]\napi_key_env = \"XAI_API_KEY\"\n```\n\nPointing a capability at a different model or provider is a one-line config change — no code\ndeploy. Providers can also be injected directly (`Router(config, providers={...})`\n\n) for tests or\ncustom wiring.\n\n| Provider | SDK (extra) | Auth |\n|---|---|---|\n`anthropic` |\n`anthropic` |\n`ANTHROPIC_API_KEY` |\n`gemini` |\n`google-genai` |\nVertex AI + ADC (`vertex_project` ), or a Gemini Developer API key |\n`grok` |\n`xai-sdk` |\n`XAI_API_KEY` |\n`openai` |\n`openai` |\n`OPENAI_API_KEY` |\n\nBeta (`0.2.x`\n\n), fully typed (`py.typed`\n\n), async-first. Routing, tool use, structured output,\nvision, **web search across all three providers** (`web_search=True`\n\n), the four strategies, and\nretry/fallback resilience are implemented and exercised against live APIs.\n\n[MIT](/iliatankelevich/maslul/blob/main/LICENSE) © Ilia Tankelevich", "url": "https://wpnews.pro/news/maslul-smart-llm-router-one-call-the-right-model", "canonical_source": "https://github.com/iliatankelevich/maslul", "published_at": "2026-06-18 05:17:29+00:00", "updated_at": "2026-06-18 05:52:56.473106+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["Maslul", "Anthropic", "Gemini", "Grok", "OpenAI", "RouteLLM", "LiteLLM"], "alternates": {"html": "https://wpnews.pro/news/maslul-smart-llm-router-one-call-the-right-model", "markdown": "https://wpnews.pro/news/maslul-smart-llm-router-one-call-the-right-model.md", "text": "https://wpnews.pro/news/maslul-smart-llm-router-one-call-the-right-model.txt", "jsonld": "https://wpnews.pro/news/maslul-smart-llm-router-one-call-the-right-model.jsonld"}}