{"slug": "show-hn-makes-local-llms-faster-and-more-reliable-by-optimizing-for-your-device", "title": "Show HN: Makes local LLMs faster and more reliable by optimizing for your device", "summary": "Autotune, a new open-source tool, optimizes local large language models by automatically right-sizing KV cache buffers, tuning precision, caching system prompts, and managing model keep-alive, freeing over 300 MB of RAM per request and reducing first-word latency by up to 53% without code changes.", "body_md": "autotune sits between your code and Ollama and applies automatic optimizations: right-sized KV buffers, KV precision tuning, system prompt caching, intelligent context management, and model keep-alive. The result: **300+ MB freed** per request, first word up to **53% faster**, and your computer stays responsive. No config changes. Your code stays exactly the same.\n\n```\npip install llm-autotune\nautotune start\n```\n\nHow it works\n\nautotune sits between your code and Ollama as a transparent proxy. Before each request reaches Ollama, autotune calculates the exact memory it needs, watches live RAM usage from your other apps, and adjusts automatically. No config. No changes to your code or output quality.\n\nEvery time Ollama runs your prompt, it must first allocate a block of RAM called the KV cache — it's where it stores the attention state for every token in the context window. By default, Ollama always allocates for 4,096 tokens. For a typical 50-word message, that's allocating 12× more RAM than the message actually needs. autotune measures the real token count, adds a safe headroom buffer, and tells Ollama the exact minimum. That freed RAM goes back to your browser, your apps, your system.\n\nBuckets (512, 768, 1024, 1536, 2048…) prevent Ollama from reallocating the Metal buffer on every call — requests with similar lengths reuse the same pre-allocated buffer, eliminating 100–300 ms of KV thrashing overhead per request.\n\nRight-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: Chrome opens a tab, Xcode compiles, a background process wakes up. autotune reads the OS's RAM utilization percentage before every single request and applies two independent levers — context window size and KV precision — across four fixed tiers, maintaining headroom well before any swap risk develops.\n\nKV precision switching (F16 → Q8) cuts the KV cache's RAM footprint in half instantly — with no meaningful quality impact. Q8 stores each attention value in 1 byte instead of 2; the difference in model output is undetectable in practice. These adjustments happen automatically — you see a brief note in the chat UI when one fires. This is a heuristic tier system based on RAM percentage. autotune also runs a separate exact-math pre-flight check (NoSwapGuard) that computes precise KV bytes using your model's architecture — that system only fires when swap is mathematically certain.\n\nIn any multi-turn chat, Ollama re-processes your entire system prompt from scratch on every message. autotune pins those tokens in the KV cache so they're only ever evaluated once — at the start. Every follow-up turn gets faster because fewer tokens need processing. The savings compound with every turn.\n\nOllama unloads the model after 5 minutes idle — a 1–4 second reload every time you come back to it. autotune keeps the model resident in memory between sessions. The weights were already using that RAM; keeping them there costs nothing extra and eliminates the cold-start delay entirely.\n\nBenchmark numbers use qwen3:8b / llama3.2:3b on Apple M2 16 GB. KV savings scale with model size — larger models free more RAM in absolute terms. Generation speed and output quality are unchanged: autotune touches only buffer sizes, precision, and scheduling — never model weights or sampling.\n\nBuilt-in dashboard\n\nautotune ships a full monitoring and control dashboard — no extra install, no external service, nothing sent to the cloud. Run `autotune serve`\n\nand open `localhost:8765/dashboard`\n\nin any browser. It auto-refreshes every 10 seconds and shows exactly what autotune is doing to your requests in real time.\n\n| Model | Requests | Avg TTFT | Avg tok/s | Avg context |\n|---|---|---|---|---|\n| qwen3:8b | 912 | 0.39s | 46.1 | 1,536 |\n| qwen2.5-coder:7b | 372 | 0.48s | 51.7 | 2,048 |\n\n```\nexport AUTOTUNE_ADMIN_KEY=\"your-secret-key\"\nautotune serve\n# → open http://localhost:8765/dashboard\n```\n\n`AUTOTUNE_ADMIN_KEY`\n\n— no key, no dashboard.Measured results\n\nBenchmarked on Apple M2 16 GB using Ollama's internal Go nanosecond timers — not wall-clock estimates. 3 runs × 5 prompt types, Wilcoxon signed-rank test. Every number here is reproducible with `autotune proof`\n\n.\n\n| Model | KV: Before | KV: After | RAM freed | First word |\n|---|---|---|---|---|\n| qwen3:8b | 576 MB | 195 MB | 381 MB | −53% |\n| llama3.2:3b | 448 MB | 155 MB | 293 MB | −35% |\n| gemma3n:e4b | 96 MB | 30 MB | 66 MB | −29% |\n\nTTFT improvement is largest when the model is cold or when RAM is under pressure. Generation speed (tok/s) is Metal GPU-bound and is not affected by autotune. KV savings apply every single request regardless of hardware.\n\nEvery request you send, Ollama allocates a KV buffer for 4,096 tokens. autotune sizes it to the actual prompt — returning hundreds of MB to your system on every single call, automatically.\n\nThe KV buffer must be initialized before token 1. A smaller buffer initializes faster. On qwen3:8b, autotune cuts first-word time from the raw baseline by 53% — every new session, every cold request.\n\nautotune changes only the KV buffer size. Model weights, sampling, and generation speed are identical. `prompt_eval_count`\n\nis unchanged — no tokens are dropped or skipped.\n\nMulti-turn & agentic workloads\n\nSingle-prompt benchmarks miss the real problem: **context accumulates**. Each tool call, each reasoning step, each file read appends more tokens. By turn 8, the model is processing 5–8× more tokens than turn 1 — and raw Ollama's fixed 4,096-token window runs out, forcing a full model reload mid-session.\n\nautotune computes a session-ceiling KV window once before the loop starts and locks it for the entire session. No reloads. And because the system prompt is pinned via prefix caching, TTFT actually *falls* as the session grows — not climbs.\n\n| Metric | Raw Ollama | autotune |\n|---|---|---|\n| Session wall time | 74 s | 40 s |\n| Model reloads | 0.5 | 0.5 |\n| TTFT trend per turn | −101 ms/turn | −435 ms/turn |\n| Swap events | 0 | 0 |\n| Context at session end | 3,043 tokens | 1,946 tokens |\n\nThe system prompt is pinned in KV after turn 1 and never re-evaluated. Each new turn only prefills the new tokens — not the full conversation from scratch. By turn 5, autotune is noticeably faster than turn 1. By turn 10, the difference compounds significantly.\n\nautotune computes a KV window for the *full session ceiling*before the first turn, then holds it constant. raw Ollama's fixed 4,096-token window fills up mid-task and forces a model reload (~1–3 s each). autotune trades a slightly higher turn-1 cost to eliminate all reloads.\n\nBenchmark: code_debugger task, N=2 trials, Apple M2 16 GB, llama3.2:3b balanced profile. Timings from Ollama's internal Go nanosecond timers. Full methodology in `AGENT_BENCHMARK.md`\n\n.\n\nVerify it yourself\n\nautotune ships with a built-in benchmark that runs two head-to-head tests on your hardware in about 30 seconds. It uses Ollama's own internal Go nanosecond timers — nothing estimated, nothing made up.\n\nWorks with any model you have installed in Ollama. Picks the smallest installed model automatically if you don't specify one.\n\n```\nautotune proof -m qwen3:8b\n# Runs in ~30 seconds. Uses Ollama's own timers.\n# Saves a proof_qwen3_8b.json you can share.\n```\n\n`autotune proof --list-models`\n\nto see which Ollama models are available on your machine.Quickstart\n\nTwo commands. No Ollama setup, no config — autotune handles everything.\n\n`pip install llm-autotune`\n\n`autotune start`\n\n`autotune chat --model qwen3:8b`\n\n`autotune proof -m qwen3:8b`\n\n`pip install \"llm-autotune[mlx]\"`\n\n``` python\nimport autotune\nfrom openai import OpenAI\n\nautotune.start()  # start the optimizing proxy\n\nclient = OpenAI(**autotune.client_kwargs())\n\nresponse = client.chat.completions.create(\n    model=\"qwen3:8b\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n)\n# Every optimization is automatic.\nautotune serve\n# → http://localhost:8765/v1\n# Any OpenAI client works automatically.\n```\n\nDocker\n\nThe Docker image bundles Ollama and autotune in a single container. No local install needed — just pull the image, mount a volume for model storage, and your OpenAI-compatible endpoint is ready on port 8765.\n\n```\n# Build once\ndocker build -t autotune .\n\n# Run — autotune on :8765, models cached in a volume\ndocker run -p 8765:8765 \\\n  -v ollama_models:/root/.ollama \\\n  -e OLLAMA_MODEL=qwen3:8b \\\n  autotune\n```\n\n`OLLAMA_MODEL`\n\nauto-pulls the model on first start. Models are cached in the named volume and persist across restarts.\n\n`--profile single`\n\nOllama + autotune in one container. Simplest setup.`--profile multi`\n\nSeparate services. Lighter autotune image (~200 MB). Set `AUTOTUNE_OLLAMA_URL=http://ollama:11434`\n\n.`ollama/ollama:latest`\n\n— includes CUDA and ROCm layers. Add `--gpus all`\n\nfor NVIDIA, or mount `/dev/kfd`\n\nfor AMD.What autotune does\n\nWhat to run\n\nautotune works with any Ollama model. These are the best options as of June 2026. Run `autotune recommend`\n\nto get a hardware-specific recommendation.\n\n| RAM | Model | Size |\n|---|---|---|\n| 8 GB | qwen3.5:4b | ~2.6 GB |\n| 16 GB | qwen3.5:9b | ~5.6 GB |\n| 16 GB | gpt-oss:20b | ~14 GB |\n| 16 GB | gemma4:12b | ~8.1 GB |\n| 24 GB | qwen3.6:27b | ~17 GB |\n| 32 GB | qwen3-coder:30b | ~19 GB |\n| 48 GB+ | gpt-oss:120b | ~65 GB |\n| Coding | devstral:24b | ~14 GB |\n| Reasoning | deepseek-r1:32b | ~20 GB |\n\nOpen source, MIT licensed. Works with whatever Ollama models you already have. The `autotune proof`\n\ncommand will show you the exact improvement on your own hardware.\n\n```\npip install llm-autotune\n```\n\n", "url": "https://wpnews.pro/news/show-hn-makes-local-llms-faster-and-more-reliable-by-optimizing-for-your-device", "canonical_source": "https://www.autotunellm.com/", "published_at": "2026-06-30 18:13:49+00:00", "updated_at": "2026-06-30 18:20:38.494426+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "developer-tools"], "entities": ["Ollama", "autotune", "Apple M2"], "alternates": {"html": "https://wpnews.pro/news/show-hn-makes-local-llms-faster-and-more-reliable-by-optimizing-for-your-device", "markdown": "https://wpnews.pro/news/show-hn-makes-local-llms-faster-and-more-reliable-by-optimizing-for-your-device.md", "text": "https://wpnews.pro/news/show-hn-makes-local-llms-faster-and-more-reliable-by-optimizing-for-your-device.txt", "jsonld": "https://wpnews.pro/news/show-hn-makes-local-llms-faster-and-more-reliable-by-optimizing-for-your-device.jsonld"}}