Show HN: Makes local LLMs faster and more reliable by optimizing for your device

Autotune, a new open-source tool, optimizes local large language models by automatically right-sizing KV cache buffers, tuning precision, caching system prompts, and managing model keep-alive, freeing over 300 MB of RAM per request and reducing first-word latency by up to 53% without code changes.

autotune sits between your code and Ollama and applies automatic optimizations: right-sized KV buffers, KV precision tuning, system prompt caching, intelligent context management, and model keep-alive. The result: 300+ MB freed per request, first word up to 53% faster , and your computer stays responsive. No config changes. Your code stays exactly the same. pip install llm-autotune autotune start How it works autotune sits between your code and Ollama as a transparent proxy. Before each request reaches Ollama, autotune calculates the exact memory it needs, watches live RAM usage from your other apps, and adjusts automatically. No config. No changes to your code or output quality. Every time Ollama runs your prompt, it must first allocate a block of RAM called the KV cache — it's where it stores the attention state for every token in the context window. By default, Ollama always allocates for 4,096 tokens. For a typical 50-word message, that's allocating 12× more RAM than the message actually needs. autotune measures the real token count, adds a safe headroom buffer, and tells Ollama the exact minimum. That freed RAM goes back to your browser, your apps, your system. Buckets 512, 768, 1024, 1536, 2048… prevent Ollama from reallocating the Metal buffer on every call — requests with similar lengths reuse the same pre-allocated buffer, eliminating 100–300 ms of KV thrashing overhead per request. Right-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: Chrome opens a tab, Xcode compiles, a background process wakes up. autotune reads the OS's RAM utilization percentage before every single request and applies two independent levers — context window size and KV precision — across four fixed tiers, maintaining headroom well before any swap risk develops. KV precision switching F16 → Q8 cuts the KV cache's RAM footprint in half instantly — with no meaningful quality impact. Q8 stores each attention value in 1 byte instead of 2; the difference in model output is undetectable in practice. These adjustments happen automatically — you see a brief note in the chat UI when one fires. This is a heuristic tier system based on RAM percentage. autotune also runs a separate exact-math pre-flight check NoSwapGuard that computes precise KV bytes using your model's architecture — that system only fires when swap is mathematically certain. In any multi-turn chat, Ollama re-processes your entire system prompt from scratch on every message. autotune pins those tokens in the KV cache so they're only ever evaluated once — at the start. Every follow-up turn gets faster because fewer tokens need processing. The savings compound with every turn. Ollama unloads the model after 5 minutes idle — a 1–4 second reload every time you come back to it. autotune keeps the model resident in memory between sessions. The weights were already using that RAM; keeping them there costs nothing extra and eliminates the cold-start delay entirely. Benchmark numbers use qwen3:8b / llama3.2:3b on Apple M2 16 GB. KV savings scale with model size — larger models free more RAM in absolute terms. Generation speed and output quality are unchanged: autotune touches only buffer sizes, precision, and scheduling — never model weights or sampling. Built-in dashboard autotune ships a full monitoring and control dashboard — no extra install, no external service, nothing sent to the cloud. Run autotune serve and open localhost:8765/dashboard in any browser. It auto-refreshes every 10 seconds and shows exactly what autotune is doing to your requests in real time. | Model | Requests | Avg TTFT | Avg tok/s | Avg context | |---|---|---|---|---| | qwen3:8b | 912 | 0.39s | 46.1 | 1,536 | | qwen2.5-coder:7b | 372 | 0.48s | 51.7 | 2,048 | export AUTOTUNE ADMIN KEY="your-secret-key" autotune serve → open http://localhost:8765/dashboard AUTOTUNE ADMIN KEY — no key, no dashboard.Measured results Benchmarked on Apple M2 16 GB using Ollama's internal Go nanosecond timers — not wall-clock estimates. 3 runs × 5 prompt types, Wilcoxon signed-rank test. Every number here is reproducible with autotune proof . | Model | KV: Before | KV: After | RAM freed | First word | |---|---|---|---|---| | qwen3:8b | 576 MB | 195 MB | 381 MB | −53% | | llama3.2:3b | 448 MB | 155 MB | 293 MB | −35% | | gemma3n:e4b | 96 MB | 30 MB | 66 MB | −29% | TTFT improvement is largest when the model is cold or when RAM is under pressure. Generation speed tok/s is Metal GPU-bound and is not affected by autotune. KV savings apply every single request regardless of hardware. Every request you send, Ollama allocates a KV buffer for 4,096 tokens. autotune sizes it to the actual prompt — returning hundreds of MB to your system on every single call, automatically. The KV buffer must be initialized before token 1. A smaller buffer initializes faster. On qwen3:8b, autotune cuts first-word time from the raw baseline by 53% — every new session, every cold request. autotune changes only the KV buffer size. Model weights, sampling, and generation speed are identical. prompt eval count is unchanged — no tokens are dropped or skipped. Multi-turn & agentic workloads Single-prompt benchmarks miss the real problem: context accumulates . Each tool call, each reasoning step, each file read appends more tokens. By turn 8, the model is processing 5–8× more tokens than turn 1 — and raw Ollama's fixed 4,096-token window runs out, forcing a full model reload mid-session. autotune computes a session-ceiling KV window once before the loop starts and locks it for the entire session. No reloads. And because the system prompt is pinned via prefix caching, TTFT actually falls as the session grows — not climbs. | Metric | Raw Ollama | autotune | |---|---|---| | Session wall time | 74 s | 40 s | | Model reloads | 0.5 | 0.5 | | TTFT trend per turn | −101 ms/turn | −435 ms/turn | | Swap events | 0 | 0 | | Context at session end | 3,043 tokens | 1,946 tokens | The system prompt is pinned in KV after turn 1 and never re-evaluated. Each new turn only prefills the new tokens — not the full conversation from scratch. By turn 5, autotune is noticeably faster than turn 1. By turn 10, the difference compounds significantly. autotune computes a KV window for the full session ceiling before the first turn, then holds it constant. raw Ollama's fixed 4,096-token window fills up mid-task and forces a model reload ~1–3 s each . autotune trades a slightly higher turn-1 cost to eliminate all reloads. Benchmark: code debugger task, N=2 trials, Apple M2 16 GB, llama3.2:3b balanced profile. Timings from Ollama's internal Go nanosecond timers. Full methodology in AGENT BENCHMARK.md . Verify it yourself autotune ships with a built-in benchmark that runs two head-to-head tests on your hardware in about 30 seconds. It uses Ollama's own internal Go nanosecond timers — nothing estimated, nothing made up. Works with any model you have installed in Ollama. Picks the smallest installed model automatically if you don't specify one. autotune proof -m qwen3:8b Runs in ~30 seconds. Uses Ollama's own timers. Saves a proof qwen3 8b.json you can share. autotune proof --list-models to see which Ollama models are available on your machine.Quickstart Two commands. No Ollama setup, no config — autotune handles everything. pip install llm-autotune autotune start autotune chat --model qwen3:8b autotune proof -m qwen3:8b pip install "llm-autotune mlx " python import autotune from openai import OpenAI autotune.start start the optimizing proxy client = OpenAI autotune.client kwargs response = client.chat.completions.create model="qwen3:8b", messages= {"role": "user", "content": "Hello "} , Every optimization is automatic. autotune serve → http://localhost:8765/v1 Any OpenAI client works automatically. Docker The Docker image bundles Ollama and autotune in a single container. No local install needed — just pull the image, mount a volume for model storage, and your OpenAI-compatible endpoint is ready on port 8765. Build once docker build -t autotune . Run — autotune on :8765, models cached in a volume docker run -p 8765:8765 \ -v ollama models:/root/.ollama \ -e OLLAMA MODEL=qwen3:8b \ autotune OLLAMA MODEL auto-pulls the model on first start. Models are cached in the named volume and persist across restarts. --profile single Ollama + autotune in one container. Simplest setup. --profile multi Separate services. Lighter autotune image ~200 MB . Set AUTOTUNE OLLAMA URL=http://ollama:11434 . ollama/ollama:latest — includes CUDA and ROCm layers. Add --gpus all for NVIDIA, or mount /dev/kfd for AMD.What autotune does What to run autotune works with any Ollama model. These are the best options as of June 2026. Run autotune recommend to get a hardware-specific recommendation. | RAM | Model | Size | |---|---|---| | 8 GB | qwen3.5:4b | ~2.6 GB | | 16 GB | qwen3.5:9b | ~5.6 GB | | 16 GB | gpt-oss:20b | ~14 GB | | 16 GB | gemma4:12b | ~8.1 GB | | 24 GB | qwen3.6:27b | ~17 GB | | 32 GB | qwen3-coder:30b | ~19 GB | | 48 GB+ | gpt-oss:120b | ~65 GB | | Coding | devstral:24b | ~14 GB | | Reasoning | deepseek-r1:32b | ~20 GB | Open source, MIT licensed. Works with whatever Ollama models you already have. The autotune proof command will show you the exact improvement on your own hardware. pip install llm-autotune