If you run automations that summarize, classify, or pull fields out of text at volume, the LLM step is where per-token pricing turns budgeting into a guessing game: one batch of long inputs and the bill spikes. For these bounded-output jobs, a flat price per call fits better than a per-token frontier model. Here is how I wire it into n8n / Make, and when not to.
Automation runs are repetitive and high-volume, and the outputs are short by nature: a summary, a label, a few extracted fields. I route them through Modelis, an OpenAI-compatible gateway that auto-routes each request to a fitting model and charges a flat price per call with output capped at ~1024 tokens. Because the output is bounded, each run costs the same and your monthly total stays predictable no matter the input size.
It is a standard OpenAI-compatible POST /v1/chat/completions
. Use an HTTP Request node:
POST
https://modelis-auto-chat.p.rapidapi.com/v1/chat/completions
x-rapidapi-host: modelis-auto-chat.p.rapidapi.com
, x-rapidapi-key: YOUR_KEY
, content-type: application/json
{"model":"modelis-auto","messages":[{"role":"user","content":"Label sentiment (positive/negative/neutral): {{ $json.text }}"}]}
The curl equivalent:
curl --request POST \
--url https://modelis-auto-chat.p.rapidapi.com/v1/chat/completions \
--header 'content-type: application/json' \
--header 'x-rapidapi-host: modelis-auto-chat.p.rapidapi.com' \
--header 'x-rapidapi-key: YOUR_KEY' \
--data '{"model":"modelis-auto","messages":[{"role":"user","content":"Summarize in 2 sentences: ..."}]}'
If you would rather use a built-in OpenAI node that expects an Authorization: Bearer
key and a custom base URL, run the tiny open-source adapter next to your workflow runner:
npx modelis-openai # local proxy on 127.0.0.1:8787, MIT, ~120 lines
Then point the node at http://127.0.0.1:8787/v1
with model modelis-auto
.
Summarize in 2 sentences: ...
Label sentiment (positive/negative/neutral): ...
Return JSON with {name, email, company} from: ...
All produce short outputs, so the flat per-call price keeps high-volume runs cheap to reason about.
Long-form generation (articles, whole files, large code) will hit the ~1024-token cap and get truncated. Keep a high-output model for those. Use this for the short, structured outputs that automations actually need.
I built the adapter. I am most curious which extraction and classification tasks the routing handles well versus badly. If you point an automation at it, I would love to hear how it routed.