{"slug": "local-first-a-model-on-your-own-machine-zero-cloud", "title": "Local-first: a Model on Your Own Machine, Zero Cloud", "summary": "A developer has published a runnable walkthrough demonstrating how to run a large language model locally on personal hardware using Ollama's OpenAI-compatible endpoint, with zero cloud costs. The demo script, part of the Portway series, shows how to call models like `gpt-oss:20b` or `llama3.1:8b` from the official OpenAI SDK while proving the stateless contract—where each request is evaluated from scratch and conversation history must be managed by the client. The walkthrough includes a compatibility table for machines with 8 GB to 48 GB of unified memory, with `prompt_tokens` remaining deterministic for identical inputs regardless of model size.", "body_md": "This is the concrete, runnable walkthrough for Post 1 of the [Portway series](https://github.com/dalenguyen/portway). The goal: stand up a single model behind an OpenAI-compatible endpoint on hardware you already own, call it from the official OpenAI SDK, and internalize the stateless contract. Everything here runs locally for $0.\n\n`demo.py`\n\nscript with two blocks:\n`usage`\n\nobject.`prompt_tokens`\n\nvalues are printed alongside an explanation of the delta.Apple Silicon Mac, 48 GB unified memory, **Ollama** already installed. The demo uses Ollama's OpenAI-compatible endpoint at `http://localhost:11434/v1`\n\nand the `gpt-oss:20b`\n\nmodel (~14 GB).\n\nThe wider Portway series uses\n\n`llama.cpp`\n\non Mac (Ollama is called out as problematic for Qwen3.5 in Post 2). For Post 1 — one model, prove the contract — Ollama is fine and already on the box.\n\nThe demo script works with any Ollama-served model — just substitute the model name in `demo.py`\n\n. The table below covers machines from 9 GB unified memory upward.\n\n| Model | Pull command | Approx size | Min RAM | Notes |\n|---|---|---|---|---|\n`llama3.2:3b` |\n`ollama pull llama3.2:3b` |\n~2 GB | 8 GB | Fastest; good for testing the contract |\n`gemma3:4b` |\n`ollama pull gemma3:4b` |\n~3 GB | 8 GB | Google; solid instruction-following |\n`mistral:7b` |\n`ollama pull mistral:7b` |\n~4.1 GB | 8 GB | Classic 7B baseline |\n`llama3.1:8b` |\n`ollama pull llama3.1:8b` |\n~4.7 GB | 9 GB | Best quality under 10 GB |\n`qwen2.5:7b` |\n`ollama pull qwen2.5:7b` |\n~4.4 GB | 9 GB | Strong at instruction + reasoning |\n`gpt-oss:20b` |\n`ollama pull gpt-oss:20b` |\n~14 GB | 24 GB | Used in this post's sample output |\n\nOn a 9 GB machine, replace `gpt-oss:20b`\n\nin `demo.py`\n\nwith `llama3.1:8b`\n\nor `qwen2.5:7b`\n\n— the contract demonstration is identical.\n\n`curl -s http://localhost:11434/api/tags`\n\nshould return JSON)`uv --version`\n\n)`gpt-oss:20b`\n\n(requires ~24 GB RAM); see Model options by available RAM for lighter alternatives on 9 GB+ machines.\n\n```\nollama pull llama3.2:3b\n```\n\nFrom the repo root:\n\n```\nuv sync                                  # creates .venv at root, installs deps\nuv run --project 1-local-first python 1-local-first/demo.py\n```\n\nA real run on this machine (M4-class Mac, 48 GB, `gpt-oss:20b`\n\nvia Ollama). Numbers will differ with smaller models — `prompt_tokens`\n\nfor the same input stays deterministic regardless of model:\n\n```\n============================================================\nBlock 1 — round-trip via OpenAI SDK against localhost\n============================================================\ncontent: Toronto, Vancouver, Montreal.\nusage:   CompletionUsage(completion_tokens=43, prompt_tokens=72, total_tokens=115, ...)\n\n============================================================\nBlock 2 — same final question, 1-turn vs 5-turn history\n============================================================\n1-turn response: The capital of Canada is **Ottawa**.\n5-turn response: The capital of Canada is **Ottawa**, located in the province of Ontario.\n\n1-turn prompt_tokens: 75\n5-turn prompt_tokens: 139\ndelta:                64\n\nWhy the delta exists: the server holds NO conversation state between\nrequests. The 5-turn call's prompt_tokens is higher only because the\nclient re-sent the full history in the request body. Each call is\nevaluated from scratch — history is the client's responsibility.\n```\n\n`completion_tokens`\n\nand the response text will vary run-to-run (sampling is non-deterministic at default temperature). `prompt_tokens`\n\nfor the same input is deterministic — 75 and 139 should reproduce.\n\nNotice how the 5-turn response picks up the road-trip context (\"located in the province of Ontario\") while the 1-turn answer riffs on the bare \"Driving.\" in its prompt — same model, different framing in the client-supplied messages.\n\nThis is the most important concept in the series. Every request to an LLM API — local or cloud — is evaluated from scratch. The server has no memory of previous turns. When you send a multi-turn conversation, **you** are the one re-sending the full history in the request body. The model sees it all at once.\n\nThe server's only \"memory\" between requests is the **prefix cache** (a compute optimisation that avoids re-evaluating tokens it has seen before), never conversation state. The cache is invisible to you — from the API contract's perspective, each call is stateless.\n\nUnderstanding this is the foundation for everything that follows in the series:\n\n`usage`\n\nrequires an explicit opt-in (`stream_options.include_usage`\n\n)`localhost`\n\n— Block 1 prints a real `content`\n\nand a `usage`\n\nobject.`prompt_tokens`\n\nwhile the server remembers nothing — Block 2 prints both numbers and the one-paragraph explanation.**Context size eats RAM/VRAM.** Ollama's default context window is conservative for most models; raising it (e.g. `ollama run llama3.2:3b`\n\n→ `/set parameter num_ctx 32768`\n\n) costs unified memory. It was not changed for this post.\n\n**gpt-oss emits a reasoning channel** (Harmony format). The engine applies the template; you still get a normal `message.content`\n\n. The reasoning channel will be segregated at the gateway in Post 3.\n\n**No streaming yet.** Post 5 covers the streaming `usage`\n\ntrap — you must opt in via `stream_options.include_usage`\n\n, otherwise `usage`\n\nis `null`\n\nin streamed responses.\n\nPost 2 moves from a single model to running multiple models simultaneously and routing requests between them — the first step toward a real local gateway.\n\nThe full series and all demo code live in the [Portway repository](https://github.com/dalenguyen/portway).", "url": "https://wpnews.pro/news/local-first-a-model-on-your-own-machine-zero-cloud", "canonical_source": "https://dev.to/dalenguyen/local-first-a-model-on-your-own-machine-zero-cloud-26dh", "published_at": "2026-05-30 18:27:54+00:00", "updated_at": "2026-05-30 18:42:10.679277+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-tools", "ai-infrastructure"], "entities": ["Portway", "Ollama", "llama.cpp", "OpenAI", "Apple Silicon", "Qwen3.5", "llama3.2:3b", "gemma3:4b"], "alternates": {"html": "https://wpnews.pro/news/local-first-a-model-on-your-own-machine-zero-cloud", "markdown": "https://wpnews.pro/news/local-first-a-model-on-your-own-machine-zero-cloud.md", "text": "https://wpnews.pro/news/local-first-a-model-on-your-own-machine-zero-cloud.txt", "jsonld": "https://wpnews.pro/news/local-first-a-model-on-your-own-machine-zero-cloud.jsonld"}}