Show HN: Makes local LLMs faster and more reliable by optimizing for your device

wpnews.pro

autotune sits between your code and Ollama and applies automatic optimizations: right-sized KV buffers, KV precision tuning, system prompt caching, intelligent context management, and model keep-alive. The result: 300+ MB freed per request, first word up to 53% faster, and your computer stays responsive. No config changes. Your code stays exactly the same.

pip install llm-autotune
autotune start

How it works

autotune sits between your code and Ollama as a transparent proxy. Before each request reaches Ollama, autotune calculates the exact memory it needs, watches live RAM usage from your other apps, and adjusts automatically. No config. No changes to your code or output quality.

Every time Ollama runs your prompt, it must first allocate a block of RAM called the KV cache — it's where it stores the attention state for every token in the context window. By default, Ollama always allocates for 4,096 tokens. For a typical 50-word message, that's allocating 12× more RAM than the message actually needs. autotune measures the real token count, adds a safe headroom buffer, and tells Ollama the exact minimum. That freed RAM goes back to your browser, your apps, your system.

Buckets (512, 768, 1024, 1536, 2048…) prevent Ollama from reallocating the Metal buffer on every call — requests with similar lengths reuse the same pre-allocated buffer, eliminating 100–300 ms of KV thrashing overhead per request.

Right-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: Chrome opens a tab, Xcode compiles, a background process wakes up. autotune reads the OS's RAM utilization percentage before every single request and applies two independent levers — context window size and KV precision — across four fixed tiers, maintaining headroom well before any swap risk develops.

KV precision switching (F16 → Q8) cuts the KV cache's RAM footprint in half instantly — with no meaningful quality impact. Q8 stores each attention value in 1 byte instead of 2; the difference in model output is undetectable in practice. These adjustments happen automatically — you see a brief note in the chat UI when one fires. This is a heuristic tier system based on RAM percentage. autotune also runs a separate exact-math pre-flight check (NoSwapGuard) that computes precise KV bytes using your model's architecture — that system only fires when swap is mathematically certain.

In any multi-turn chat, Ollama re-processes your entire system prompt from scratch on every message. autotune pins those tokens in the KV cache so they're only ever evaluated once — at the start. Every follow-up turn gets faster because fewer tokens need processing. The savings compound with every turn.

Ollama unloads the model after 5 minutes idle — a 1–4 second reload every time you come back to it. autotune keeps the model resident in memory between sessions. The weights were already using that RAM; keeping them there costs nothing extra and eliminates the cold-start delay entirely.

Benchmark numbers use qwen3:8b / llama3.2:3b on Apple M2 16 GB. KV savings scale with model size — larger models free more RAM in absolute terms. Generation speed and output quality are unchanged: autotune touches only buffer sizes, precision, and scheduling — never model weights or sampling.

Built-in dashboard

autotune ships a full monitoring and control dashboard — no extra install, no external service, nothing sent to the cloud. Run autotune serve

and open localhost:8765/dashboard

in any browser. It auto-refreshes every 10 seconds and shows exactly what autotune is doing to your requests in real time.

Model	Requests	Avg TTFT	Avg tok/s	Avg context
qwen3:8b	912	0.39s	46.1	1,536
qwen2.5-coder:7b	372	0.48s	51.7	2,048

export AUTOTUNE_ADMIN_KEY="your-secret-key"
autotune serve

AUTOTUNE_ADMIN_KEY

— no key, no dashboard.Measured results

Benchmarked on Apple M2 16 GB using Ollama's internal Go nanosecond timers — not wall-clock estimates. 3 runs × 5 prompt types, Wilcoxon signed-rank test. Every number here is reproducible with autotune proof

.

Model	KV: Before	KV: After	RAM freed	First word
qwen3:8b	576 MB	195 MB	381 MB	−53%
llama3.2:3b	448 MB	155 MB	293 MB	−35%
gemma3n:e4b	96 MB	30 MB	66 MB	−29%

TTFT improvement is largest when the model is cold or when RAM is under pressure. Generation speed (tok/s) is Metal GPU-bound and is not affected by autotune. KV savings apply every single request regardless of hardware.

Every request you send, Ollama allocates a KV buffer for 4,096 tokens. autotune sizes it to the actual prompt — returning hundreds of MB to your system on every single call, automatically.

The KV buffer must be initialized before token 1. A smaller buffer initializes faster. On qwen3:8b, autotune cuts first-word time from the raw baseline by 53% — every new session, every cold request.

autotune changes only the KV buffer size. Model weights, sampling, and generation speed are identical. prompt_eval_count

is unchanged — no tokens are dropped or skipped.

Multi-turn & agentic workloads

Single-prompt benchmarks miss the real problem: context accumulates. Each tool call, each reasoning step, each file read appends more tokens. By turn 8, the model is processing 5–8× more tokens than turn 1 — and raw Ollama's fixed 4,096-token window runs out, forcing a full model reload mid-session.

autotune computes a session-ceiling KV window once before the loop starts and locks it for the entire session. No reloads. And because the system prompt is pinned via prefix caching, TTFT actually falls as the session grows — not climbs.

Metric	Raw Ollama	autotune
Session wall time	74 s	40 s
Model reloads	0.5	0.5
TTFT trend per turn	−101 ms/turn	−435 ms/turn
Swap events	0	0
Context at session end	3,043 tokens	1,946 tokens

The system prompt is pinned in KV after turn 1 and never re-evaluated. Each new turn only prefills the new tokens — not the full conversation from scratch. By turn 5, autotune is noticeably faster than turn 1. By turn 10, the difference compounds significantly.

autotune computes a KV window for the full session ceilingbefore the first turn, then holds it constant. raw Ollama's fixed 4,096-token window fills up mid-task and forces a model reload (~1–3 s each). autotune trades a slightly higher turn-1 cost to eliminate all reloads.

Benchmark: code_debugger task, N=2 trials, Apple M2 16 GB, llama3.2:3b balanced profile. Timings from Ollama's internal Go nanosecond timers. Full methodology in AGENT_BENCHMARK.md

.

Verify it yourself

autotune ships with a built-in benchmark that runs two head-to-head tests on your hardware in about 30 seconds. It uses Ollama's own internal Go nanosecond timers — nothing estimated, nothing made up.

Works with any model you have installed in Ollama. Picks the smallest installed model automatically if you don't specify one.

autotune proof -m qwen3:8b

autotune proof --list-models

to see which Ollama models are available on your machine.Quickstart

Two commands. No Ollama setup, no config — autotune handles everything.

pip install llm-autotune

autotune start

autotune chat --model qwen3:8b

autotune proof -m qwen3:8b

pip install "llm-autotune[mlx]"

import autotune
from openai import OpenAI

autotune.start()  # start the optimizing proxy

client = OpenAI(**autotune.client_kwargs())

response = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
autotune serve

Docker

The Docker image bundles Ollama and autotune in a single container. No local install needed — just pull the image, mount a volume for model storage, and your OpenAI-compatible endpoint is ready on port 8765.

docker build -t autotune .

docker run -p 8765:8765 \
  -v ollama_models:/root/.ollama \
  -e OLLAMA_MODEL=qwen3:8b \
  autotune

OLLAMA_MODEL

auto-pulls the model on first start. Models are cached in the named volume and persist across restarts.

--profile single

Ollama + autotune in one container. Simplest setup.--profile multi

Separate services. Lighter autotune image (~200 MB). Set AUTOTUNE_OLLAMA_URL=http://ollama:11434

.ollama/ollama:latest

— includes CUDA and ROCm layers. Add --gpus all

for NVIDIA, or mount /dev/kfd

for AMD.What autotune does

What to run

autotune works with any Ollama model. These are the best options as of June 2026. Run autotune recommend

to get a hardware-specific recommendation.

RAM	Model	Size
8 GB	qwen3.5:4b	~2.6 GB
16 GB	qwen3.5:9b	~5.6 GB
16 GB	gpt-oss:20b	~14 GB
16 GB	gemma4:12b	~8.1 GB
24 GB	qwen3.6:27b	~17 GB
32 GB	qwen3-coder:30b	~19 GB
48 GB+	gpt-oss:120b	~65 GB
Coding	devstral:24b	~14 GB
Reasoning	deepseek-r1:32b	~20 GB

Open source, MIT licensed. Works with whatever Ollama models you already have. The autotune proof

command will show you the exact improvement on your own hardware.

pip install llm-autotune

source & further reading

autotunellm.com — original article

Show HN: Makes local LLMs faster and more reliable by optimizing for your device

Run your AI side-project on zahid.host