Local-first: a Model on Your Own Machine, Zero Cloud

A developer has published a runnable walkthrough demonstrating how to run a large language model locally on personal hardware using Ollama's OpenAI-compatible endpoint, with zero cloud costs. The demo script, part of the Portway series, shows how to call models like `gpt-oss:20b` or `llama3.1:8b` from the official OpenAI SDK while proving the stateless contract—where each request is evaluated from scratch and conversation history must be managed by the client. The walkthrough includes a compatibility table for machines with 8 GB to 48 GB of unified memory, with `prompt_tokens` remaining deterministic for identical inputs regardless of model size.

This is the concrete, runnable walkthrough for Post 1 of the Portway series https://github.com/dalenguyen/portway . The goal: stand up a single model behind an OpenAI-compatible endpoint on hardware you already own, call it from the official OpenAI SDK, and internalize the stateless contract. Everything here runs locally for $0. demo.py script with two blocks: usage object. prompt tokens values are printed alongside an explanation of the delta.Apple Silicon Mac, 48 GB unified memory, Ollama already installed. The demo uses Ollama's OpenAI-compatible endpoint at http://localhost:11434/v1 and the gpt-oss:20b model ~14 GB . The wider Portway series uses llama.cpp on Mac Ollama is called out as problematic for Qwen3.5 in Post 2 . For Post 1 — one model, prove the contract — Ollama is fine and already on the box. The demo script works with any Ollama-served model — just substitute the model name in demo.py . The table below covers machines from 9 GB unified memory upward. | Model | Pull command | Approx size | Min RAM | Notes | |---|---|---|---|---| llama3.2:3b | ollama pull llama3.2:3b | ~2 GB | 8 GB | Fastest; good for testing the contract | gemma3:4b | ollama pull gemma3:4b | ~3 GB | 8 GB | Google; solid instruction-following | mistral:7b | ollama pull mistral:7b | ~4.1 GB | 8 GB | Classic 7B baseline | llama3.1:8b | ollama pull llama3.1:8b | ~4.7 GB | 9 GB | Best quality under 10 GB | qwen2.5:7b | ollama pull qwen2.5:7b | ~4.4 GB | 9 GB | Strong at instruction + reasoning | gpt-oss:20b | ollama pull gpt-oss:20b | ~14 GB | 24 GB | Used in this post's sample output | On a 9 GB machine, replace gpt-oss:20b in demo.py with llama3.1:8b or qwen2.5:7b — the contract demonstration is identical. curl -s http://localhost:11434/api/tags should return JSON uv --version gpt-oss:20b requires ~24 GB RAM ; see Model options by available RAM for lighter alternatives on 9 GB+ machines. ollama pull llama3.2:3b From the repo root: uv sync creates .venv at root, installs deps uv run --project 1-local-first python 1-local-first/demo.py A real run on this machine M4-class Mac, 48 GB, gpt-oss:20b via Ollama . Numbers will differ with smaller models — prompt tokens for the same input stays deterministic regardless of model: ============================================================ Block 1 — round-trip via OpenAI SDK against localhost ============================================================ content: Toronto, Vancouver, Montreal. usage: CompletionUsage completion tokens=43, prompt tokens=72, total tokens=115, ... ============================================================ Block 2 — same final question, 1-turn vs 5-turn history ============================================================ 1-turn response: The capital of Canada is Ottawa . 5-turn response: The capital of Canada is Ottawa , located in the province of Ontario. 1-turn prompt tokens: 75 5-turn prompt tokens: 139 delta: 64 Why the delta exists: the server holds NO conversation state between requests. The 5-turn call's prompt tokens is higher only because the client re-sent the full history in the request body. Each call is evaluated from scratch — history is the client's responsibility. completion tokens and the response text will vary run-to-run sampling is non-deterministic at default temperature . prompt tokens for the same input is deterministic — 75 and 139 should reproduce. Notice how the 5-turn response picks up the road-trip context "located in the province of Ontario" while the 1-turn answer riffs on the bare "Driving." in its prompt — same model, different framing in the client-supplied messages. This is the most important concept in the series. Every request to an LLM API — local or cloud — is evaluated from scratch. The server has no memory of previous turns. When you send a multi-turn conversation, you are the one re-sending the full history in the request body. The model sees it all at once. The server's only "memory" between requests is the prefix cache a compute optimisation that avoids re-evaluating tokens it has seen before , never conversation state. The cache is invisible to you — from the API contract's perspective, each call is stateless. Understanding this is the foundation for everything that follows in the series: usage requires an explicit opt-in stream options.include usage localhost — Block 1 prints a real content and a usage object. prompt tokens while the server remembers nothing — Block 2 prints both numbers and the one-paragraph explanation. Context size eats RAM/VRAM. Ollama's default context window is conservative for most models; raising it e.g. ollama run llama3.2:3b → /set parameter num ctx 32768 costs unified memory. It was not changed for this post. gpt-oss emits a reasoning channel Harmony format . The engine applies the template; you still get a normal message.content . The reasoning channel will be segregated at the gateway in Post 3. No streaming yet. Post 5 covers the streaming usage trap — you must opt in via stream options.include usage , otherwise usage is null in streamed responses. Post 2 moves from a single model to running multiple models simultaneously and routing requests between them — the first step toward a real local gateway. The full series and all demo code live in the Portway repository https://github.com/dalenguyen/portway .