{"slug": "mlx-optiq-per-layer-mixed-precision-llm-quantization-for-apple-silicon", "title": "Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon", "summary": "Mlx-optiq, a new open-source tool, enables per-layer mixed-precision quantization of large language models on Apple Silicon, allowing users to run, fine-tune, and serve LLMs locally on Macs without a GPU cluster. The tool includes pre-built quantized models on Hugging Face, LoRA fine-tuning, and an API server compatible with OpenAI and Anthropic protocols.", "body_md": "#\nQuantize, *fine-tune*\n\nand serve LLMs\n\nentirely on Apple Silicon.\n\nRun large language models locally on your Mac, from M1 to M5. Per-layer sensitivity analysis for mixed-precision weights. LoRA fine-tuning that respects the bit budget. A server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). Send it an image, not just text, on any vision-capable model. No GPU cluster, no API key.\n\n`$ pip install mlx-optiq`\n\n## Drop-in *4-bit* quants. Same weights, smarter bits.\n\nSixteen production mlx-optiq-quantized LLMs on Hugging Face. Nemotron 3, MiniCPM5, Qwen3.5, Qwen3.6 and Gemma-4 families, from 1 B dense to 35 B-A3B mixture-of-experts. They load directly into stock `mlx-lm`\n\n. No special runtime.\n\n[Gemma-4 · new\n](https://huggingface.co/mlx-community/gemma-4-12B-it-OptiQ-4bit)\n\n### gemma-4-12B-it-OptiQ-4bit\n\nGoogle's unified text+vision Gemma-4, at 8.3 GB, with image input. Capability Score 68.2 (+6.4 vs uniform-4-bit), one of our largest mixed-precision gains, and the strongest model we ship under 9 GB on disk.\n\n**8.3 GB** on disk\n\n**68.2** Capability\n\n**+6.4** vs U4\n\n[Gemma-4\n](https://huggingface.co/mlx-community/gemma-4-31B-it-OptiQ-4bit)\n\n### gemma-4-31B-it-OptiQ-4bit\n\nThe largest single quant we ship. 31 B parameters in 20.8 GB with Capability Score 79.7 (+3.5 vs uniform-4-bit). Pair with the matching `-assistant-bf16`\n\ndrafter for speculative decoding.\n\n**20.8 GB** on disk\n\n**79.7** Capability\n\n**+3.5** vs U4\n\n[Qwen3.6\n](https://huggingface.co/mlx-community/Qwen3.6-27B-OptiQ-4bit)\n\n### Qwen3.6-27B-OptiQ-4bit\n\nFrontier-class reasoning at 17.5 GB with our highest Capability Score (83.0). Bundled MTP head gives ~1.4× decode via `optiq serve --mtp`\n\n.\n\n**17.5 GB** on disk\n\n**83.0** Capability\n\n**+0.5** vs U4\n\n[Qwen3.5\n](https://huggingface.co/mlx-community/Qwen3.5-9B-OptiQ-4bit)\n\n### Qwen3.5-9B-OptiQ-4bit\n\nThe default daily-driver. 9 B parameters in 6.6 GB. Capability Score 66.8 (+0.2 vs uniform-4-bit). Long context to 64 k via mixed-precision KV; bundled MTP head for speculative decoding.\n\n**6.6 GB** on disk\n\n**66.8** Capability\n\n**+0.2** vs U4\n\n## From *zero* to a serving LLM in three commands.\n\nEach step is reversible and works with stock MLX tools. mlx-optiq is additive. Skip any of these and you still have a working pipeline.\n\n### Install\n\nPure Python. Pulls in `mlx`\n\n, `mlx-lm`\n\nand `huggingface-hub`\n\n. Python 3.11+ on Apple Silicon.\n\n``` bash\n$ pip install mlx-optiq\n```\n\n### Use a pre-built quant\n\nPre-built mlx-optiq quants load with stock `mlx-lm`\n\n. Per-layer bit assignment is recorded in the model metadata. No special loader required.\n\n``` python\nfrom mlx_lm import load, generate\n\nmodel, tok = load(\"mlx-community/Qwen3.5-9B-OptiQ-4bit\")\nout = generate(model, tok, prompt=\"Explain mixed-precision quantization.\", max_tokens=200)\nprint(out)\n```\n\n### Serve with mixed-precision KV\n\nThe KV cache is its own sensitivity problem. `optiq kv-cache`\n\nmeasures it once per model; `optiq serve`\n\nserves with the resulting per-layer config behind an OpenAI-compatible API.\n\n``` bash\n# 1-2 min, once per model\n$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \\\n    --target-bits 5.0 -o ./kv\n\n# OpenAI + Anthropic compatible server on :8080\n# /v1/chat/completions  (OpenAI)\n# /v1/messages          (Anthropic; works with Claude Code, anthropic SDK, etc.)\n$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \\\n    --kv-config ./kv/kv_config.json \\\n    --port 8080\n```\n\n[getting-started guide](/docs/qwen3.5)with model-specific sampling defaults and recommended use cases. Building an agent? Drop\n\n[llms.txt](/llms.txt)into your IDE. It's the entire library reference in one Markdown file.\n\n## One sensitivity signal. *A whole toolkit* around it.\n\nA single per-layer KL-divergence pass drives weight, KV-cache and LoRA-rank allocation. The rest of the toolkit (hot-swap adapters, multi-protocol serving with five tested client integrations, image input on the vision models, and the OptIQ Lab GUI for quantize, fine-tune, dataset, and chat workflows) sits around that core.\n\n### Mixed-precision weights\n\nPer-layer KL on calibration data picks the bits. Sensitive layers stay high-precision, the rest go low, at the same average size as uniform-4.\n\n### Mixed-precision KV cache\n\nA separate sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than average, so uniform 4-bit KV is catastrophic; mixed-precision is not.\n\n### LoRA, two ways\n\nFine-tune with adapter rank scaled by each layer's bits, then keep N adapters mounted on one base and switch them per request, no reload.\n\n### Text and images, one stack\n\nRun text models, and send pictures to the ones that take vision. The vendored vision tower rides in a bf16 sidecar, so one repo loads text-only under `mlx-lm`\n\nor full image+text under OptIQ. [Vision docs](/docs/vision).\n\n### OpenAI and Anthropic APIs\n\n`optiq serve`\n\nspeaks both the OpenAI and Anthropic protocols from one process. Point [Claude Code, Codex, OpenCode, OpenClaw, or Hermes Agent](/docs/integrations/) at a local quant.\n\n### OptIQ Lab, a local GUI\n\nA web UI for the whole workflow: quantize wizard, SFT/DPO fine-tuning with a dataset designer, and chat with sandboxed tools (web search, Python, terminal) and image upload.\n\n## Sensitivity, in *three steps*.\n\nUniform 4-bit quantization treats every layer the same, but layers are not the same. mlx-optiq measures, then allocates.\n\n### 1. Measure\n\nFor each layer, simulate-quantize *just that layer* at each candidate bit-width.\nForward-pass calibration data. Measure KL divergence between the perturbed\nlogits and the reference logits. Repeat for every layer; you now have a\n*(layer, bits) → quality cost* table.\n\n### 2. Allocate\n\nGreedy knapsack on the table: start every layer at the lowest bit-width,\nthen greedily upgrade the layer that buys the most KL-reduction per extra bit\nuntil the average bit-budget is exhausted. Layers like `lm_head`\n\nand the first/last attention blocks are protected at 8-bit by default.\n\n### 3. Convert\n\nHand the per-layer bit map to `mlx_lm.convert`\n\nas a quant\npredicate. The output is a standard MLX checkpoint that loads anywhere\nstock `mlx-lm`\n\nloads, with sensitivity metadata stashed on\nthe side for downstream LoRA training.\n\n```\n# Auto-routes between bf16 and uniform-4-bit reference\n# based on available RAM.\n$ optiq convert Qwen/Qwen3.5-9B \\\n    --target-bpw 5.0 \\\n    --candidate-bits 4,8 \\\n    --reference auto \\\n    -o optiq_output/Qwen3.5-9B\n```\n\n`--reference auto`\n\npicks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline\nwith bf16-streaming probes, so 27 B+ models still get a calibration-driven\nsignal on a 36 GB Mac. The full methodology lives in\n[our research write-up →](/blog/not-all-layers-are-equal)\n\n## Where mlx-optiq sits among the *Mac LLM* options.\n\nA snapshot of how the popular paths stack up on the things that actually move quality and speed on Apple Silicon. None of these are wrong; they're optimizing different axes.\n\nmlx-optiq |\nmlx-lm | llama.cpp | |\n|---|---|---|---|\n| Per-layer mixed-precision weights | Yes, calibration-driven | Uniform 4-bit | Block-wise K-quant |\n| Per-layer mixed-precision KV cache | Yes | Uniform 4 / 8 / fp16 | Group-wise int8 only |\n| Sensitivity-aware LoRA fine-tuning | Rank scaled by per-layer bits | Constant rank LoRA | Inference only |\nOpenAI and Anthropic compatible server |\nOne process, both | OpenAI only | llama-server (OpenAI shim) |\n| Text and image input | Yes | Text only | Image via separate build |\n| Sandboxed tool support for chat | Three tools: web search, Python, terminal | — | — |\n\n## Running LLMs on a Mac, *answered*.\n\n### Can I run LLMs locally on a Mac?\n\nYes. mlx-optiq runs large language models natively on Apple Silicon, from M1 to M5, using Apple's MLX framework. Install it from PyPI with `pip install mlx-optiq`\n\n, then quantize, fine-tune, and serve models entirely on your Mac, fully offline.\n\n### Do I need a GPU or PyTorch to run LLMs on a Mac?\n\nNo. There is no PyTorch and no discrete GPU in the path. mlx-optiq is MLX-native and uses the unified memory of Apple Silicon directly, so a MacBook, Mac mini, or Mac Studio is enough. No CUDA, no cloud, no API key.\n\n### How much RAM do I need to run an LLM on a Mac?\n\nIt depends on the model. A 4-bit OptIQ quant of a 4B model needs roughly 3 GB; a 9B needs about 6 GB; larger mixture-of-experts models need more. Mixed-precision 4-bit quantization is what lets bigger models fit in a Mac's memory while staying close to full-precision quality.\n\n### What is mlx-optiq?\n\nAn MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon. Its core is data-driven mixed-precision quantization: it measures each layer's sensitivity and assigns per-layer bit-widths, so quants keep more quality than uniform 4-bit at the same size. It also ships a local web UI (OptIQ Lab) and an OpenAI and Anthropic compatible server.\n\n## Make your Mac an *LLM workstation*.\n\nPick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.", "url": "https://wpnews.pro/news/mlx-optiq-per-layer-mixed-precision-llm-quantization-for-apple-silicon", "canonical_source": "https://mlx-optiq.com/", "published_at": "2026-06-14 16:20:43+00:00", "updated_at": "2026-06-14 16:42:37.078996+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "ai-tools", "developer-tools", "ai-infrastructure"], "entities": ["Apple", "Hugging Face", "Google", "Gemma-4", "Qwen3.5", "Qwen3.6", "Nemotron", "MiniCPM5"], "alternates": {"html": "https://wpnews.pro/news/mlx-optiq-per-layer-mixed-precision-llm-quantization-for-apple-silicon", "markdown": "https://wpnews.pro/news/mlx-optiq-per-layer-mixed-precision-llm-quantization-for-apple-silicon.md", "text": "https://wpnews.pro/news/mlx-optiq-per-layer-mixed-precision-llm-quantization-for-apple-silicon.txt", "jsonld": "https://wpnews.pro/news/mlx-optiq-per-layer-mixed-precision-llm-quantization-for-apple-silicon.jsonld"}}