cd /news/large-language-models/local-first-a-model-on-your-own-mach… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-18694] src=dev.to pub= topic=large-language-models verified=true sentiment=↑ positive

Local-first: a Model on Your Own Machine, Zero Cloud

A developer has published a runnable walkthrough demonstrating how to run a large language model locally on personal hardware using Ollama's OpenAI-compatible endpoint, with zero cloud costs. The demo script, part of the Portway series, shows how to call models like `gpt-oss:20b` or `llama3.1:8b` from the official OpenAI SDK while proving the stateless contractβ€”where each request is evaluated from scratch and conversation history must be managed by the client. The walkthrough includes a compatibility table for machines with 8 GB to 48 GB of unified memory, with `prompt_tokens` remaining deterministic for identical inputs regardless of model size.

read4 min publishedMay 30, 2026

This is the concrete, runnable walkthrough for Post 1 of the Portway series. The goal: stand up a single model behind an OpenAI-compatible endpoint on hardware you already own, call it from the official OpenAI SDK, and internalize the stateless contract. Everything here runs locally for $0.

demo.py

script with two blocks: usage

object.prompt_tokens

values are printed alongside an explanation of the delta.Apple Silicon Mac, 48 GB unified memory, Ollama already installed. The demo uses Ollama's OpenAI-compatible endpoint at http://localhost:11434/v1

and the gpt-oss:20b

model (~14 GB).

The wider Portway series uses

llama.cpp

on Mac (Ollama is called out as problematic for Qwen3.5 in Post 2). For Post 1 β€” one model, prove the contract β€” Ollama is fine and already on the box.

The demo script works with any Ollama-served model β€” just substitute the model name in demo.py

. The table below covers machines from 9 GB unified memory upward.

Model Pull command Approx size Min RAM Notes
llama3.2:3b
ollama pull llama3.2:3b
~2 GB 8 GB Fastest; good for testing the contract
gemma3:4b
ollama pull gemma3:4b
~3 GB 8 GB Google; solid instruction-following
mistral:7b
ollama pull mistral:7b
~4.1 GB 8 GB Classic 7B baseline
llama3.1:8b
ollama pull llama3.1:8b
~4.7 GB 9 GB Best quality under 10 GB
qwen2.5:7b
ollama pull qwen2.5:7b
~4.4 GB 9 GB Strong at instruction + reasoning
gpt-oss:20b
ollama pull gpt-oss:20b
~14 GB 24 GB Used in this post's sample output

On a 9 GB machine, replace gpt-oss:20b

in demo.py

with llama3.1:8b

or qwen2.5:7b

β€” the contract demonstration is identical.

curl -s http://localhost:11434/api/tags

should return JSON)uv --version

)gpt-oss:20b

(requires ~24 GB RAM); see Model options by available RAM for lighter alternatives on 9 GB+ machines.

ollama pull llama3.2:3b

From the repo root:

uv sync                                  # creates .venv at root, installs deps
uv run --project 1-local-first python 1-local-first/demo.py

A real run on this machine (M4-class Mac, 48 GB, gpt-oss:20b

via Ollama). Numbers will differ with smaller models β€” prompt_tokens

for the same input stays deterministic regardless of model:

content: Toronto, Vancouver, Montreal. usage: CompletionUsage(completion_tokens=43, prompt_tokens=72, total_tokens=115, ...)

1-turn response: The capital of Canada is Ottawa. 5-turn response: The capital of Canada is Ottawa, located in the province of Ontario.

1-turn prompt_tokens: 75 5-turn prompt_tokens: 139 delta: 64

Why the delta exists: the server holds NO conversation state between requests. The 5-turn call's prompt_tokens is higher only because the client re-sent the full history in the request body. Each call is evaluated from scratch β€” history is the client's responsibility.


`completion_tokens`

and the response text will vary run-to-run (sampling is non-deterministic at default temperature). `prompt_tokens`

for the same input is deterministic β€” 75 and 139 should reproduce.

Notice how the 5-turn response picks up the road-trip context ("located in the province of Ontario") while the 1-turn answer riffs on the bare "Driving." in its prompt β€” same model, different framing in the client-supplied messages.

This is the most important concept in the series. Every request to an LLM API β€” local or cloud β€” is evaluated from scratch. The server has no memory of previous turns. When you send a multi-turn conversation, **you** are the one re-sending the full history in the request body. The model sees it all at once.

The server's only "memory" between requests is the **prefix cache** (a compute optimisation that avoids re-evaluating tokens it has seen before), never conversation state. The cache is invisible to you β€” from the API contract's perspective, each call is stateless.

Understanding this is the foundation for everything that follows in the series:

`usage`

requires an explicit opt-in (`stream_options.include_usage`

)`localhost`

β€” Block 1 prints a real `content`

and a `usage`

object.`prompt_tokens`

while the server remembers nothing β€” Block 2 prints both numbers and the one-paragraph explanation.**Context size eats RAM/VRAM.** Ollama's default context window is conservative for most models; raising it (e.g. `ollama run llama3.2:3b`

β†’ `/set parameter num_ctx 32768`

) costs unified memory. It was not changed for this post.

**gpt-oss emits a reasoning channel** (Harmony format). The engine applies the template; you still get a normal `message.content`

. The reasoning channel will be segregated at the gateway in Post 3.

**No streaming yet.** Post 5 covers the streaming `usage`

trap β€” you must opt in via `stream_options.include_usage`

, otherwise `usage`

is `null`

in streamed responses.

Post 2 moves from a single model to running multiple models simultaneously and routing requests between them β€” the first step toward a real local gateway.

The full series and all demo code live in the [Portway repository](https://github.com/dalenguyen/portway).
── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/local-first-a-model-…] indexed:0 read:4min 2026-05-30 Β· β€”