Maslul – Smart LLM router – one call, the right model

wpnews.pro

Smart LLM router — one call, the right model.

Async and fully typed, across Anthropic, Gemini, xAI Grok, and OpenAI — routing each request to the right model tier by difficulty. Stop hardcoding model choices and stop re-writing the tool-use / structured-output / web-search / retry plumbing for every provider.

maslul

(Hebrew מסלול, "route / lane") is a small library that does exactly two things: routing (pick a model tier per request, or pin one) and provider normalization (one Request

/Response

shape for every SDK). No server, no CLI, no heavy ML deps — providers live behind extras, and the core is stdlib-only.

import asyncio
from maslul import Router, Request, Message

router = Router.from_toml("maslul.toml")           # tiers + classifier + providers, from config

async def main() -> None:
    resp = await router.complete(Request(messages=[Message(role="user", content="Hello!")]))
    print(resp.text, "·", resp.level_used, "·", resp.usage.output_tokens, "tokens")

asyncio.run(main())
pip install "maslul[anthropic,gemini,grok]"     # or just the providers you use

Each provider's SDK lives behind an extra, so import maslul

pulls in none of them — you only install what you route to. maslul[anthropic]

→ anthropic

; maslul[gemini]

→ google-genai

; maslul[grok]

→ xai-sdk

; maslul[openai]

→ openai

.

maslul is a library, not a gateway — you embed the routing brain in your app, you don't run a proxy in front of it.

maslul | RouteLLM | LiteLLM | | |---|---|---|---| | Shape | async library you embed (no server) | research framework / trained router | unified SDK + proxy server | | Routing | difficulty tiers + swappable strategies (route_default / classify / classify_and_answer / verify_cascade ) + injectable bypass / classifier / verifier hooks | a trained strong-vs-weak router | manual config / fallback lists, load-balancing | | Providers | Anthropic · Gemini · Grok · OpenAI, normalized | model-agnostic (you wire models) | 100+ providers | | Tools / structured / vision | one normalized loop for all | — | per-provider | Web search | one flag, every provider → Response.sources | — | per-provider | | Caching | exact + semantic (in-process) | — | exact + semantic (proxy) | | Typing / footprint | fully typed, py.typed ; stdlib core, SDKs behind extras | research code | larger; server to operate |

Choose maslul when you want a typed async library you embed — difficulty routing with your own strategy + hooks, and one Request

/Response

over several providers (tools, structured output, vision, web search, retries, cost cache) — without standing up a gateway. Reach for LiteLLM when you want a provider proxy across 100+ models, or RouteLLM when you specifically want a trained router.

flowchart LR
    R["complete(req)"] --> M{"model= pin?"}
    M -- yes --> RUN["run that model"]
    M -- no --> L{"level= pin?"}
    L -- yes --> RUN
    L -- no --> B{"bypass_predicate?"}
    B -- "tier" --> RUN
    B -- "None" --> H{"hard_signal?<br/>(media · code · long · intent verbs)"}
    H -- "yes" --> HARD["HARD tier"] --> RUN
    H -- "no" --> S["strategy<br/>route_default · classify ·<br/>classify_and_answer · verify_cascade"] --> RUN
    RUN --> X["tool loop · web search ·<br/>retry / fallback · usage breakdown"]

Difficulty is not readable from surface features — a short prompt can be very hard, a long paste trivial — so maslul never applies a short ⇒ simple

rule. You choose how each request is routed, in this precedence order:

from maslul import Level

await router.complete(req, model="anthropic:claude-opus-4-8")  # 0. pin an exact model
await router.complete(req, level=Level.HARD)                   # 1. pin a difficulty tier
await router.complete(req)                                     # 2-4. let the router decide

When you don't pin, the routing brain runs: a deterministic bypass (your fast-path, e.g. greetings → SIMPLE) → a hard-signal detector (intent verbs, code, attachments, long context → HARD, up-only) → the configured strategy for the ambiguous middle:

Strategy	Cost for the middle	What it does
`ROUTE_DEFAULT`
0 calls	Default-to-capable (`default_level` ). Best for low volume.
`CLASSIFY`
1 classify + 1 answer	A cheap dedicated classifier model labels the level (cached + budget-guarded), then dispatch.
`CLASSIFY_AND_ANSWER`
1 call	The classifier model answers directly, or emits an escalation sentinel to bump to a stronger tier.
`VERIFY_CASCADE`
1 cheap + verify	Answer cheap, run your verifier, escalate if it rejects — catches silent under-escalation.

All three injection points are yours to supply:

def my_classifier(req):      # your own difficulty call (sync or async); None defers to the strategy
    return Level.SIMPLE if is_trivial(req) else None

def my_verifier(req, resp):  # VERIFY_CASCADE: True keeps the cheap answer, False escalates
    return "I don't know" not in resp.text

router = Router.from_toml("maslul.toml", classifier=my_classifier, verifier=my_verifier)

The same Request

/Response

works across all three providers:

from maslul import Request, Message, ToolDef, ToolCall, MediaPart

async def get_weather(call: ToolCall) -> str:
    return f"18°C in {call.input['city']}"

req = Request(
    messages=[Message(role="user", content="Weather in Paris?")],
    tools=[ToolDef(name="get_weather", description="Current weather for a city.",
                   input_schema={"type": "object", "properties": {"city": {"type": "string"}},
                                 "required": ["city"]})],
    tool_executor=get_weather,
)

req = Request(messages=[Message(role="user", content="Extract name + age")],
              response_format={"type": "object", "properties": {"name": {"type": "string"},
                                                                "age": {"type": "integer"}}})

req = Request(messages=[Message(role="user", content="What's in this image?")],
              media=[MediaPart(mime_type="image/png", data=png_bytes)])

req = Request(messages=[Message(role="user", content="Latest news on X?")], web_search=True)
python
def on_usage(resp):                         # per-model token breakdown for monitoring
    for rec in resp.usage_records:
        metrics.incr(f"{rec.provider}:{rec.model}", rec.usage.output_tokens)

router = Router.from_toml("maslul.toml", on_complete=on_usage)

Transient errors (RateLimited

, Timeout

) retry with exponential backoff; on persistent failure the request falls back to the next-higher tier — which may be a different provider, giving you cross-provider failover for free. AuthError

fails fast. Hooks: on_route

(the RoutingDecision

), on_complete

(the final Response

with usage_records

), on_error

(each failed attempt).

Build a router with missing_provider="degrade"

and any tier whose provider isn't configured (e.g. a Grok tier with no XAI_API_KEY

) falls back to the nearest available tier instead of erroring — so one config runs across deploys that have different keys.

A [maslul.cache]

config returns a prior Response

instead of calling a model — exact

(identical request) or semantic

(nearest request above a cosine threshold, using an embedder you inject, since maslul ships no embeddings). A hit comes back with cached=True

and zeroed usage, so monitoring sees the saving. Tool-using requests are never cached.

[maslul.cache]
mode = "semantic"          # off | exact | semantic
max_entries = 1000
ttl_seconds = 86400
similarity_threshold = 0.95
router = Router.from_toml("maslul.toml", embed=my_async_embed)   # embed only needed for semantic

A TOML file (or a plain dict

— Router(config={...})

):

[maslul]
strategy = "route_default"        # route_default | classify | classify_and_answer | verify_cascade
default_level = "hard"            # default-to-capable for the ambiguous middle
min_tokens_to_classify = 40       # CLASSIFY budget guard
request_timeout = 60              # per-call seconds (optional)
max_retries = 2
fallback = true                   # escalate to a higher tier on persistent failure

[maslul.tiers.simple]
provider = "gemini"
model = "gemini-2.5-flash-lite"
[maslul.tiers.medium]
model = "anthropic:claude-haiku-4-5"   # or the provider:model shorthand
[maslul.tiers.hard]
model = "anthropic:claude-sonnet-4-6"

[maslul.classifier]               # required for the classify strategies
model = "anthropic:claude-haiku-4-5"

[maslul.providers.anthropic]
api_key_env = "ANTHROPIC_API_KEY"      # secrets by env-var name, never inlined
[maslul.providers.gemini]
vertex_project = "my-gcp-project"      # Vertex AI + Application Default Credentials (no key)
vertex_location = "global"
[maslul.providers.grok]
api_key_env = "XAI_API_KEY"

Pointing a capability at a different model or provider is a one-line config change — no code deploy. Providers can also be injected directly (Router(config, providers={...})

) for tests or custom wiring.

Provider	SDK (extra)	Auth
`anthropic`
`anthropic`
`ANTHROPIC_API_KEY`
`gemini`
`google-genai`
Vertex AI + ADC (`vertex_project` ), or a Gemini Developer API key
`grok`
`xai-sdk`
`XAI_API_KEY`
`openai`
`openai`
`OPENAI_API_KEY`

Beta (0.2.x

), fully typed (py.typed

), async-first. Routing, tool use, structured output, vision, web search across all three providers (web_search=True

), the four strategies, and retry/fallback resilience are implemented and exercised against live APIs.

source & further reading

github.com — original article

Maslul – Smart LLM router – one call, the right model

Run your AI side-project on zahid.host