Local-first: a Model on Your Own Machine, Zero Cloud

wpnews.pro

cd /news/large-language-models/local-first-a-model-on-your-own-mach… · home › topics › large-language-models › article

[ARTICLE · art-18694] src=dev.to ↗ pub=2026-05-30T18:27Z topic=large-language-models verified=true sentiment=↑ positive

Local-first: a Model on Your Own Machine, Zero Cloud

A developer has published a runnable walkthrough demonstrating how to run a large language model locally on personal hardware using Ollama's OpenAI-compatible endpoint, with zero cloud costs. The demo script, part of the Portway series, shows how to call models like `gpt-oss:20b` or `llama3.1:8b` from the official OpenAI SDK while proving the stateless contract—where each request is evaluated from scratch and conversation history must be managed by the client. The walkthrough includes a compatibility table for machines with 8 GB to 48 GB of unified memory, with `prompt_tokens` remaining deterministic for identical inputs regardless of model size.

read4 min views19 publishedMay 30, 2026

This is the concrete, runnable walkthrough for Post 1 of the Portway series. The goal: stand up a single model behind an OpenAI-compatible endpoint on hardware you already own, call it from the official OpenAI SDK, and internalize the stateless contract. Everything here runs locally for $0.

demo.py

script with two blocks: usage

object.prompt_tokens

values are printed alongside an explanation of the delta.Apple Silicon Mac, 48 GB unified memory, Ollama already installed. The demo uses Ollama's OpenAI-compatible endpoint at http://localhost:11434/v1

and the gpt-oss:20b

model (~14 GB).

The wider Portway series uses

llama.cpp

on Mac (Ollama is called out as problematic for Qwen3.5 in Post 2). For Post 1 — one model, prove the contract — Ollama is fine and already on the box.

The demo script works with any Ollama-served model — just substitute the model name in demo.py

. The table below covers machines from 9 GB unified memory upward.

Model	Pull command	Approx size
`llama3.2:3b`
`ollama pull llama3.2:3b`
~2 GB	8 GB	Fastest; good for testing the contract
`gemma3:4b`
`ollama pull gemma3:4b`
~3 GB	8 GB	Google; solid instruction-following
`mistral:7b`
`ollama pull mistral:7b`
~4.1 GB	8 GB	Classic 7B baseline
`llama3.1:8b`
`ollama pull llama3.1:8b`
~4.7 GB	9 GB	Best quality under 10 GB
`qwen2.5:7b`
`ollama pull qwen2.5:7b`
~4.4 GB	9 GB	Strong at instruction + reasoning
`gpt-oss:20b`
`ollama pull gpt-oss:20b`
~14 GB	24 GB	Used in this post's sample output

On a 9 GB machine, replace gpt-oss:20b

in demo.py

with llama3.1:8b

or qwen2.5:7b

— the contract demonstration is identical.

curl -s http://localhost:11434/api/tags

should return JSON)uv --version

)gpt-oss:20b

(requires ~24 GB RAM); see Model options by available RAM for lighter alternatives on 9 GB+ machines.

ollama pull llama3.2:3b

From the repo root:

uv sync                                  # creates .venv at root, installs deps
uv run --project 1-local-first python 1-local-first/demo.py

A real run on this machine (M4-class Mac, 48 GB, gpt-oss:20b

via Ollama). Numbers will differ with smaller models — prompt_tokens

for the same input stays deterministic regardless of model:

content: Toronto, Vancouver, Montreal. usage: CompletionUsage(completion_tokens=43, prompt_tokens=72, total_tokens=115, ...)

1-turn response: The capital of Canada is Ottawa. 5-turn response: The capital of Canada is Ottawa, located in the province of Ontario.

1-turn prompt_tokens: 75 5-turn prompt_tokens: 139 delta: 64

Why the delta exists: the server holds NO conversation state between requests. The 5-turn call's prompt_tokens is higher only because the client re-sent the full history in the request body. Each call is evaluated from scratch — history is the client's responsibility.


`completion_tokens`

and the response text will vary run-to-run (sampling is non-deterministic at default temperature). `prompt_tokens`

for the same input is deterministic — 75 and 139 should reproduce.

Notice how the 5-turn response picks up the road-trip context ("located in the province of Ontario") while the 1-turn answer riffs on the bare "Driving." in its prompt — same model, different framing in the client-supplied messages.

This is the most important concept in the series. Every request to an LLM API — local or cloud — is evaluated from scratch. The server has no memory of previous turns. When you send a multi-turn conversation, **you** are the one re-sending the full history in the request body. The model sees it all at once.

The server's only "memory" between requests is the **prefix cache** (a compute optimisation that avoids re-evaluating tokens it has seen before), never conversation state. The cache is invisible to you — from the API contract's perspective, each call is stateless.

Understanding this is the foundation for everything that follows in the series:

`usage`

requires an explicit opt-in (`stream_options.include_usage`

)`localhost`

— Block 1 prints a real `content`

and a `usage`

object.`prompt_tokens`

while the server remembers nothing — Block 2 prints both numbers and the one-paragraph explanation.**Context size eats RAM/VRAM.** Ollama's default context window is conservative for most models; raising it (e.g. `ollama run llama3.2:3b`

→ `/set parameter num_ctx 32768`

) costs unified memory. It was not changed for this post.

**gpt-oss emits a reasoning channel** (Harmony format). The engine applies the template; you still get a normal `message.content`

. The reasoning channel will be segregated at the gateway in Post 3.

**No streaming yet.** Post 5 covers the streaming `usage`

trap — you must opt in via `stream_options.include_usage`

, otherwise `usage`

is `null`

in streamed responses.

Post 2 moves from a single model to running multiple models simultaneously and routing requests between them — the first step toward a real local gateway.

The full series and all demo code live in the [Portway repository](https://github.com/dalenguyen/portway).

source & further reading

dev.to — original article I Spent $47 Last Month Testing Every AI API So You Don't Have To Building cross-distro offline voice dictation for Linux (and why it took more than a model) Automatic Standby PDB Instantiation and Standby Redo Log Creation in Oracle AI Database 26ai(23.26.2)

~/api · this article 200

$curl api.wpnews.pro/v1/news/local-first-a-model-on-y…

Read original on dev.to → dev.to/dalenguyen/local-first-a-model-on-your-ow…

mentioned entities

Portway

Ollama

llama.cpp

OpenAI

Apple Silicon

Qwen3.5

llama3.2:3b

gemma3:4b

metadata

sluglocal-first-a-model-on-your-own-machine-zero-cloud

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevOrganizations Adopt AI While Gov…

next →1 AI Stock That Could Turn $500 …

── more in #large-language-models 4 stories · sorted by recency

dev.to · 15 Jul · #large-language-models

I Spent $47 Last Month Testing Every AI API So You Don't Have To

cryptobriefing.com · 15 Jul · #large-language-models

AI chip selloff erases over $1 trillion as custom silicon threatens Nvidia’s dominance

mindstudio.ai · 14 Jul · #large-language-models

Local AI vs Cloud AI for Agents: The Hybrid Routing Strategy That Saves Money

dev.to · 15 Jul · #large-language-models

Building cross-distro offline voice dictation for Linux (and why it took more than a model)

── more on @portway 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required