{"slug": "squish-the-fastest-way-to-run-local-llms-on-apple-silicon", "title": "Squish – The fastest way to run local LLMs on Apple Silicon", "summary": "Squish, a new local AI agent runtime for Apple Silicon, claims to load models 54× faster than standard paths and serve them faster than Ollama, with full offline capability and no cloud dependencies. The tool uses INT4 compression, memory-mapped loading, and a two-cache architecture to achieve sub-second cold starts and reduced memory usage, targeting developers who need private, on-device AI inference.", "body_md": "# Squish\n\nThe Local AI\n\nAgent Runtime.\n\nRun any AI model, fully local, on Apple Silicon. Squish loads models in under a second—**54× faster** than the standard path—and serves them faster than Ollama. **No cloud, no API keys, fully offline.**\n\n# squish\n\nThe local AI agent runtime\n\nInstall once\n\nbrew install konjoai/squish/squish\n\n✓ squish 9.34.8 installed\n\nOne command does everything\n\nsquish run qwen2.5:7b\n\n↓ Pulling qwen2.5:7b 4.0 GB\n\n✓ Model ready 0.43s\n\n✓ Chat open at http://localhost:11435 🌐\n\n## Up and running in two steps\n\nInstall once. Then `squish run`\n\nhandles pull, compress, serve, and opens your chat UI automatically.\n\nOne Homebrew command. No Docker, no CUDA, no virtual environment setup.\n\n`brew install konjoai/squish/squish`\n\nDownloads the pre-optimised model if needed, loads in milliseconds, opens your chat UI in the browser.\n\n`squish run qwen2.5:7b`\n\n`squish serve`\n\nis an alias for `squish run`\n\n— use whichever feels right.\n\n## Your data never\n\nleaves your Mac\n\nEvery inference runs on your hardware, in your memory. No telemetry on conversations, no API quotas, no usage bills. Fast, private AI you own outright.\n\nEverything runs on-device — no API rate limits, no per-token billing, no data leaving your Mac.\n\nINT4 compression turns a 16 GB BF16 8B model into 4.4 GB. Run two models where you used to fit one.\n\nCalibrated quantisation holds benchmark accuracy to ≤1.5 pp across ARC-Easy, HellaSwag, WinoGrande, and PIQA at the tested sample size.\n\nSquish ships 100+ composable optimisation modules. Each release improves TTFT and decode throughput, applied automatically.\n\n## Built for speed at every layer\n\nFrom storage format to HTTP serving, every decision is optimised for Apple Silicon unified memory.\n\nMemory-mapped INT4 tensors load directly into Metal unified memory with zero dtype conversion. A 1.5B model is ready in **0.33–0.53 s** — versus 28.8 s for the standard loader, on 160 MB of RAM.\n\nZero code changes. LangChain, LlamaIndex, OpenAI SDK, Cursor, and any tool that speaks `/v1/chat/completions`\n\nworks out of the box.\n\nAgents resend the same long system prompt every turn. Squish's **two-cache architecture** reuses the prefill instead of re-running it—so a repeated prompt skips straight to decode.\n\nSmall models hallucinate syntax. Squish uses engine-level **Finite State Machine (FSM) masking** to constrain every token to valid JSON matching your schema. Agents never crash a parser again.\n\nA 32k context window normally pushes a 16 GB Mac into swap. Squish's **Asymmetric INT4 KV Cache** shrinks the KV footprint by 75%, keeping all context hot in unified memory.\n\nProcess multiple prompts in a single request. Essential for evals, data pipelines, and bulk generation—a capability Ollama and LM Studio don't offer.\n\n## Why Squish beats the rest\n\nReal measurements, same hardware. Apple M3 MacBook Pro, 16 GB — thermally controlled.\n\n| Metric | Ollama | LM Studio | Squish ✶ |\n|---|---|---|---|\n| Cold start — load + first token | 20–30 s | ~18–28 s | 0.5 s ✶ |\n| Decode throughput — 7B | 20.3 tok/s | — | 24.0 tok/s ✶ |\n| Inter-token tail latency (p95) | 52 ms | — | 43 ms ✶ |\n| Full response — 4000-token prompt | 37.5 s | — | 3.8 s 9.8× ✶ |\n| Peak RAM — serving | 5.1 GB | — | 3.5 GB ✶ |\n| Disk size — 7B INT4 | 4.4 GB (GGUF Q4) | 4.7 GB (GGUF Q4) | 4.0 GB INT4 ✶ |\n| OpenAI API | ✓ | ✓ | ✓ |\n| Batch requests | ✗ | ✗ | ✓ |\n| Pre-optimised weights (HuggingFace) | ✗ | ✗ | ✓ 9 prebuilt |\n| Auto-open chat UI | ✗ | ✓ | ✓ |\n| Zero-copy mmap Metal load | ✗ | ✗ | ✓ |\n| Repeat-prompt TTFT (KV cache hit) | ~160 ms | — | 4–11 ms ✶ |\n| Guaranteed JSON Syntax (FSM) | ✗ | ✗ | ✓ 100% Reliable |\n| Context Window Compression | FP16 Only (High VRAM) | FP16 Only | INT4 (75% Less VRAM) |\n\n✶ M3 16 GB, thermally controlled. Cold start: Qwen2.5-1.5B. Serving (decode, tail, E2E, RAM): Qwen2.5-7B INT3 vs Ollama 0.30.7. Squish v9.34.8. On a loaded model, single-token TTFT is comparable (Ollama 167 ms / Squish 192 ms) — Squish’s edge is everywhere else.\n\n## Everything you need, right here\n\nmacOS via Homebrew (recommended)\n\nbrew install konjoai/squish/squish\n\n✓ squish 9.34.8 installed\n\nOr via pip (Python 3.11–3.14)\n\npip install squish-ai\n\nVerify installation\n\nsquish --version\n\nsquish 9.34.8\n\nOne command: pull, optimise, serve, open browser\n\nsquish run qwen2.5:7b\n\n↓ Pulling qwen2.5:7b 4.0 GB ██████████ 100%\n\n✓ Model ready 0.43s\n\n✓ Server http://localhost:11435\n\n✓ Chat UI opening in browser... 🌐\n\nNo model? Interactive picker appears\n\nsquish run\n\n? Choose a model:\n\n> qwen2.5:7b 4.0 GB · INT4 (recommended)\n\nqwen3:4b 2.3 GB · INT4\n\nllama3.2:3b 1.5 GB · INT4\n\nBrowser UI opens automatically after squish run\n\n┌─────────────────────────────────────┐\n\n│ 🟣 squish localhost:11435 │\n\n├─────────────────────────────────────┤\n\n│ Model: qwen2.5:7b ▾ │\n\n├─────────────────────────────────────┤\n\n│ │\n\n│ 🟢 Hi! Running on your Mac. │\n\n│ No cloud. No cost. Fully private. │\n\n│ │\n\n│ You: [ ] → │\n\n└─────────────────────────────────────┘\n\nTerminal chat (no browser)\n\nsquish chat qwen2.5:7b\n\nLoading... 0.4s\n\nYou: Hello!\n\nAI: Hi! How can I help? <streams instantly>\n\nsquish run already starts a server; or start it manually\n\nsquish serve qwen2.5:7b\n\n→ http://localhost:11435 (OpenAI-compatible)\n\nZero code changes from OpenAI SDK\n\npython3\n\nimport openai\n\nclient = openai.OpenAI(\n\nbase_url=\"http://localhost:11435/v1\",\n\napi_key=\"local\"\n\n)\n\nr = client.chat.completions.create(\n\nmodel=\"qwen2.5:7b\",\n\nmessages=[{\"role\":\"user\",\"content\":\"Hello!\"}]\n\n)\n\nprint(r.choices[0].message.content)\n\nOllama-compatible too\n\nexport OLLAMA_HOST=http://localhost:11435\n\nCompress any HuggingFace model to INT4 locally\n\nsquish pull meta-llama/Llama-3.3-70B-Instruct --int4\n\nDownloading weights...\n\nQuantising INT4 ████████████ 100%\n\n✓ 18.2 GB → 4.9 GB (73% smaller)\n\nINT8 for near-lossless quality (~50% smaller)\n\nsquish pull meta-llama/Llama-3.3-70B-Instruct --int8\n\n✓ 18.2 GB → 9.2 GB (within measurement noise)\n\nRust quantizer: 4-6x faster compression (optional)\n\ncargo build --release -p squish_quant_rs\n\n## Join the Squish community\n\nChat, contribute, and share pre-squished models with others running local AI on Apple Silicon.\n\n## Reclaim your VRAM.\n\nUnleash your Agents.\n\nTurn your MacBook into a fast, private local AI runtime in under 60 seconds. No cloud, no API bills.", "url": "https://wpnews.pro/news/squish-the-fastest-way-to-run-local-llms-on-apple-silicon", "canonical_source": "https://squish.run/", "published_at": "2026-06-29 05:32:11+00:00", "updated_at": "2026-06-29 05:58:06.380094+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure", "developer-tools"], "entities": ["Squish", "Konjo AI", "Ollama", "LM Studio", "Apple Silicon", "Metal", "HuggingFace", "Qwen2.5"], "alternates": {"html": "https://wpnews.pro/news/squish-the-fastest-way-to-run-local-llms-on-apple-silicon", "markdown": "https://wpnews.pro/news/squish-the-fastest-way-to-run-local-llms-on-apple-silicon.md", "text": "https://wpnews.pro/news/squish-the-fastest-way-to-run-local-llms-on-apple-silicon.txt", "jsonld": "https://wpnews.pro/news/squish-the-fastest-way-to-run-local-llms-on-apple-silicon.jsonld"}}