autotune sits between your code and Ollama and applies automatic optimizations: right-sized KV buffers, KV precision tuning, system prompt caching, intelligent context management, and model keep-alive. The result: 300+ MB freed per request, first word up to 53% faster, and your computer stays responsive. No config changes. Your code stays exactly the same.
pip install llm-autotune
autotune start
How it works
autotune sits between your code and Ollama as a transparent proxy. Before each request reaches Ollama, autotune calculates the exact memory it needs, watches live RAM usage from your other apps, and adjusts automatically. No config. No changes to your code or output quality.
Every time Ollama runs your prompt, it must first allocate a block of RAM called the KV cache โ it's where it stores the attention state for every token in the context window. By default, Ollama always allocates for 4,096 tokens. For a typical 50-word message, that's allocating 12ร more RAM than the message actually needs. autotune measures the real token count, adds a safe headroom buffer, and tells Ollama the exact minimum. That freed RAM goes back to your browser, your apps, your system.
Buckets (512, 768, 1024, 1536, 2048โฆ) prevent Ollama from reallocating the Metal buffer on every call โ requests with similar lengths reuse the same pre-allocated buffer, eliminating 100โ300 ms of KV thrashing overhead per request.
Right-sizing the KV cache at request time is the foundation. But RAM usage on your machine is dynamic: Chrome opens a tab, Xcode compiles, a background process wakes up. autotune reads the OS's RAM utilization percentage before every single request and applies two independent levers โ context window size and KV precision โ across four fixed tiers, maintaining headroom well before any swap risk develops.
KV precision switching (F16 โ Q8) cuts the KV cache's RAM footprint in half instantly โ with no meaningful quality impact. Q8 stores each attention value in 1 byte instead of 2; the difference in model output is undetectable in practice. These adjustments happen automatically โ you see a brief note in the chat UI when one fires. This is a heuristic tier system based on RAM percentage. autotune also runs a separate exact-math pre-flight check (NoSwapGuard) that computes precise KV bytes using your model's architecture โ that system only fires when swap is mathematically certain.
In any multi-turn chat, Ollama re-processes your entire system prompt from scratch on every message. autotune pins those tokens in the KV cache so they're only ever evaluated once โ at the start. Every follow-up turn gets faster because fewer tokens need processing. The savings compound with every turn.
Ollama unloads the model after 5 minutes idle โ a 1โ4 second reload every time you come back to it. autotune keeps the model resident in memory between sessions. The weights were already using that RAM; keeping them there costs nothing extra and eliminates the cold-start delay entirely.
Benchmark numbers use qwen3:8b / llama3.2:3b on Apple M2 16 GB. KV savings scale with model size โ larger models free more RAM in absolute terms. Generation speed and output quality are unchanged: autotune touches only buffer sizes, precision, and scheduling โ never model weights or sampling.
Built-in dashboard
autotune ships a full monitoring and control dashboard โ no extra install, no external service, nothing sent to the cloud. Run autotune serve
and open localhost:8765/dashboard
in any browser. It auto-refreshes every 10 seconds and shows exactly what autotune is doing to your requests in real time.
| Model | Requests | Avg TTFT | Avg tok/s | Avg context |
|---|---|---|---|---|
| qwen3:8b | 912 | 0.39s | 46.1 | 1,536 |
| qwen2.5-coder:7b | 372 | 0.48s | 51.7 | 2,048 |
export AUTOTUNE_ADMIN_KEY="your-secret-key"
autotune serve
AUTOTUNE_ADMIN_KEY
โ no key, no dashboard.Measured results
Benchmarked on Apple M2 16 GB using Ollama's internal Go nanosecond timers โ not wall-clock estimates. 3 runs ร 5 prompt types, Wilcoxon signed-rank test. Every number here is reproducible with autotune proof
.
| Model | KV: Before | KV: After | RAM freed | First word |
|---|---|---|---|---|
| qwen3:8b | 576 MB | 195 MB | 381 MB | โ53% |
| llama3.2:3b | 448 MB | 155 MB | 293 MB | โ35% |
| gemma3n:e4b | 96 MB | 30 MB | 66 MB | โ29% |
TTFT improvement is largest when the model is cold or when RAM is under pressure. Generation speed (tok/s) is Metal GPU-bound and is not affected by autotune. KV savings apply every single request regardless of hardware.
Every request you send, Ollama allocates a KV buffer for 4,096 tokens. autotune sizes it to the actual prompt โ returning hundreds of MB to your system on every single call, automatically.
The KV buffer must be initialized before token 1. A smaller buffer initializes faster. On qwen3:8b, autotune cuts first-word time from the raw baseline by 53% โ every new session, every cold request.
autotune changes only the KV buffer size. Model weights, sampling, and generation speed are identical. prompt_eval_count
is unchanged โ no tokens are dropped or skipped.
Multi-turn & agentic workloads
Single-prompt benchmarks miss the real problem: context accumulates. Each tool call, each reasoning step, each file read appends more tokens. By turn 8, the model is processing 5โ8ร more tokens than turn 1 โ and raw Ollama's fixed 4,096-token window runs out, forcing a full model reload mid-session.
autotune computes a session-ceiling KV window once before the loop starts and locks it for the entire session. No reloads. And because the system prompt is pinned via prefix caching, TTFT actually falls as the session grows โ not climbs.
| Metric | Raw Ollama | autotune |
|---|---|---|
| Session wall time | 74 s | 40 s |
| Model reloads | 0.5 | 0.5 |
| TTFT trend per turn | โ101 ms/turn | โ435 ms/turn |
| Swap events | 0 | 0 |
| Context at session end | 3,043 tokens | 1,946 tokens |
The system prompt is pinned in KV after turn 1 and never re-evaluated. Each new turn only prefills the new tokens โ not the full conversation from scratch. By turn 5, autotune is noticeably faster than turn 1. By turn 10, the difference compounds significantly.
autotune computes a KV window for the full session ceilingbefore the first turn, then holds it constant. raw Ollama's fixed 4,096-token window fills up mid-task and forces a model reload (~1โ3 s each). autotune trades a slightly higher turn-1 cost to eliminate all reloads.
Benchmark: code_debugger task, N=2 trials, Apple M2 16 GB, llama3.2:3b balanced profile. Timings from Ollama's internal Go nanosecond timers. Full methodology in AGENT_BENCHMARK.md
.
Verify it yourself
autotune ships with a built-in benchmark that runs two head-to-head tests on your hardware in about 30 seconds. It uses Ollama's own internal Go nanosecond timers โ nothing estimated, nothing made up.
Works with any model you have installed in Ollama. Picks the smallest installed model automatically if you don't specify one.
autotune proof -m qwen3:8b
autotune proof --list-models
to see which Ollama models are available on your machine.Quickstart
Two commands. No Ollama setup, no config โ autotune handles everything.
pip install llm-autotune
autotune start
autotune chat --model qwen3:8b
autotune proof -m qwen3:8b
pip install "llm-autotune[mlx]"
import autotune
from openai import OpenAI
autotune.start() # start the optimizing proxy
client = OpenAI(**autotune.client_kwargs())
response = client.chat.completions.create(
model="qwen3:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
autotune serve
Docker
The Docker image bundles Ollama and autotune in a single container. No local install needed โ just pull the image, mount a volume for model storage, and your OpenAI-compatible endpoint is ready on port 8765.
docker build -t autotune .
docker run -p 8765:8765 \
-v ollama_models:/root/.ollama \
-e OLLAMA_MODEL=qwen3:8b \
autotune
OLLAMA_MODEL
auto-pulls the model on first start. Models are cached in the named volume and persist across restarts.
--profile single
Ollama + autotune in one container. Simplest setup.--profile multi
Separate services. Lighter autotune image (~200 MB). Set AUTOTUNE_OLLAMA_URL=http://ollama:11434
.ollama/ollama:latest
โ includes CUDA and ROCm layers. Add --gpus all
for NVIDIA, or mount /dev/kfd
for AMD.What autotune does
What to run
autotune works with any Ollama model. These are the best options as of June 2026. Run autotune recommend
to get a hardware-specific recommendation.
| RAM | Model | Size |
|---|---|---|
| 8 GB | qwen3.5:4b | ~2.6 GB |
| 16 GB | qwen3.5:9b | ~5.6 GB |
| 16 GB | gpt-oss:20b | ~14 GB |
| 16 GB | gemma4:12b | ~8.1 GB |
| 24 GB | qwen3.6:27b | ~17 GB |
| 32 GB | qwen3-coder:30b | ~19 GB |
| 48 GB+ | gpt-oss:120b | ~65 GB |
| Coding | devstral:24b | ~14 GB |
| Reasoning | deepseek-r1:32b | ~20 GB |
Open source, MIT licensed. Works with whatever Ollama models you already have. The autotune proof
command will show you the exact improvement on your own hardware.
pip install llm-autotune