This is the concrete, runnable walkthrough for Post 1 of the Portway series. The goal: stand up a single model behind an OpenAI-compatible endpoint on hardware you already own, call it from the official OpenAI SDK, and internalize the stateless contract. Everything here runs locally for $0.
demo.py
script with two blocks:
usage
object.prompt_tokens
values are printed alongside an explanation of the delta.Apple Silicon Mac, 48 GB unified memory, Ollama already installed. The demo uses Ollama's OpenAI-compatible endpoint at http://localhost:11434/v1
and the gpt-oss:20b
model (~14 GB).
The wider Portway series uses
llama.cpp
on Mac (Ollama is called out as problematic for Qwen3.5 in Post 2). For Post 1 β one model, prove the contract β Ollama is fine and already on the box.
The demo script works with any Ollama-served model β just substitute the model name in demo.py
. The table below covers machines from 9 GB unified memory upward.
| Model | Pull command | Approx size | Min RAM | Notes |
|---|---|---|---|---|
llama3.2:3b |
||||
ollama pull llama3.2:3b |
||||
| ~2 GB | 8 GB | Fastest; good for testing the contract | ||
gemma3:4b |
||||
ollama pull gemma3:4b |
||||
| ~3 GB | 8 GB | Google; solid instruction-following | ||
mistral:7b |
||||
ollama pull mistral:7b |
||||
| ~4.1 GB | 8 GB | Classic 7B baseline | ||
llama3.1:8b |
||||
ollama pull llama3.1:8b |
||||
| ~4.7 GB | 9 GB | Best quality under 10 GB | ||
qwen2.5:7b |
||||
ollama pull qwen2.5:7b |
||||
| ~4.4 GB | 9 GB | Strong at instruction + reasoning | ||
gpt-oss:20b |
||||
ollama pull gpt-oss:20b |
||||
| ~14 GB | 24 GB | Used in this post's sample output |
On a 9 GB machine, replace gpt-oss:20b
in demo.py
with llama3.1:8b
or qwen2.5:7b
β the contract demonstration is identical.
curl -s http://localhost:11434/api/tags
should return JSON)uv --version
)gpt-oss:20b
(requires ~24 GB RAM); see Model options by available RAM for lighter alternatives on 9 GB+ machines.
ollama pull llama3.2:3b
From the repo root:
uv sync # creates .venv at root, installs deps
uv run --project 1-local-first python 1-local-first/demo.py
A real run on this machine (M4-class Mac, 48 GB, gpt-oss:20b
via Ollama). Numbers will differ with smaller models β prompt_tokens
for the same input stays deterministic regardless of model:
content: Toronto, Vancouver, Montreal. usage: CompletionUsage(completion_tokens=43, prompt_tokens=72, total_tokens=115, ...)
1-turn response: The capital of Canada is Ottawa. 5-turn response: The capital of Canada is Ottawa, located in the province of Ontario.
1-turn prompt_tokens: 75 5-turn prompt_tokens: 139 delta: 64
Why the delta exists: the server holds NO conversation state between requests. The 5-turn call's prompt_tokens is higher only because the client re-sent the full history in the request body. Each call is evaluated from scratch β history is the client's responsibility.
`completion_tokens`
and the response text will vary run-to-run (sampling is non-deterministic at default temperature). `prompt_tokens`
for the same input is deterministic β 75 and 139 should reproduce.
Notice how the 5-turn response picks up the road-trip context ("located in the province of Ontario") while the 1-turn answer riffs on the bare "Driving." in its prompt β same model, different framing in the client-supplied messages.
This is the most important concept in the series. Every request to an LLM API β local or cloud β is evaluated from scratch. The server has no memory of previous turns. When you send a multi-turn conversation, **you** are the one re-sending the full history in the request body. The model sees it all at once.
The server's only "memory" between requests is the **prefix cache** (a compute optimisation that avoids re-evaluating tokens it has seen before), never conversation state. The cache is invisible to you β from the API contract's perspective, each call is stateless.
Understanding this is the foundation for everything that follows in the series:
`usage`
requires an explicit opt-in (`stream_options.include_usage`
)`localhost`
β Block 1 prints a real `content`
and a `usage`
object.`prompt_tokens`
while the server remembers nothing β Block 2 prints both numbers and the one-paragraph explanation.**Context size eats RAM/VRAM.** Ollama's default context window is conservative for most models; raising it (e.g. `ollama run llama3.2:3b`
β `/set parameter num_ctx 32768`
) costs unified memory. It was not changed for this post.
**gpt-oss emits a reasoning channel** (Harmony format). The engine applies the template; you still get a normal `message.content`
. The reasoning channel will be segregated at the gateway in Post 3.
**No streaming yet.** Post 5 covers the streaming `usage`
trap β you must opt in via `stream_options.include_usage`
, otherwise `usage`
is `null`
in streamed responses.
Post 2 moves from a single model to running multiple models simultaneously and routing requests between them β the first step toward a real local gateway.
The full series and all demo code live in the [Portway repository](https://github.com/dalenguyen/portway).