Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon

Mlx-optiq, a new open-source tool, enables per-layer mixed-precision quantization of large language models on Apple Silicon, allowing users to run, fine-tune, and serve LLMs locally on Macs without a GPU cluster. The tool includes pre-built quantized models on Hugging Face, LoRA fine-tuning, and an API server compatible with OpenAI and Anthropic protocols.

Quantize, fine-tune and serve LLMs entirely on Apple Silicon. Run large language models locally on your Mac, from M1 to M5. Per-layer sensitivity analysis for mixed-precision weights. LoRA fine-tuning that respects the bit budget. A server that speaks both OpenAI and Anthropic APIs point Claude Code at your local quant . Send it an image, not just text, on any vision-capable model. No GPU cluster, no API key. $ pip install mlx-optiq Drop-in 4-bit quants. Same weights, smarter bits. Sixteen production mlx-optiq-quantized LLMs on Hugging Face. Nemotron 3, MiniCPM5, Qwen3.5, Qwen3.6 and Gemma-4 families, from 1 B dense to 35 B-A3B mixture-of-experts. They load directly into stock mlx-lm . No special runtime. Gemma-4 · new https://huggingface.co/mlx-community/gemma-4-12B-it-OptiQ-4bit gemma-4-12B-it-OptiQ-4bit Google's unified text+vision Gemma-4, at 8.3 GB, with image input. Capability Score 68.2 +6.4 vs uniform-4-bit , one of our largest mixed-precision gains, and the strongest model we ship under 9 GB on disk. 8.3 GB on disk 68.2 Capability +6.4 vs U4 Gemma-4 https://huggingface.co/mlx-community/gemma-4-31B-it-OptiQ-4bit gemma-4-31B-it-OptiQ-4bit The largest single quant we ship. 31 B parameters in 20.8 GB with Capability Score 79.7 +3.5 vs uniform-4-bit . Pair with the matching -assistant-bf16 drafter for speculative decoding. 20.8 GB on disk 79.7 Capability +3.5 vs U4 Qwen3.6 https://huggingface.co/mlx-community/Qwen3.6-27B-OptiQ-4bit Qwen3.6-27B-OptiQ-4bit Frontier-class reasoning at 17.5 GB with our highest Capability Score 83.0 . Bundled MTP head gives ~1.4× decode via optiq serve --mtp . 17.5 GB on disk 83.0 Capability +0.5 vs U4 Qwen3.5 https://huggingface.co/mlx-community/Qwen3.5-9B-OptiQ-4bit Qwen3.5-9B-OptiQ-4bit The default daily-driver. 9 B parameters in 6.6 GB. Capability Score 66.8 +0.2 vs uniform-4-bit . Long context to 64 k via mixed-precision KV; bundled MTP head for speculative decoding. 6.6 GB on disk 66.8 Capability +0.2 vs U4 From zero to a serving LLM in three commands. Each step is reversible and works with stock MLX tools. mlx-optiq is additive. Skip any of these and you still have a working pipeline. Install Pure Python. Pulls in mlx , mlx-lm and huggingface-hub . Python 3.11+ on Apple Silicon. bash $ pip install mlx-optiq Use a pre-built quant Pre-built mlx-optiq quants load with stock mlx-lm . Per-layer bit assignment is recorded in the model metadata. No special loader required. python from mlx lm import load, generate model, tok = load "mlx-community/Qwen3.5-9B-OptiQ-4bit" out = generate model, tok, prompt="Explain mixed-precision quantization.", max tokens=200 print out Serve with mixed-precision KV The KV cache is its own sensitivity problem. optiq kv-cache measures it once per model; optiq serve serves with the resulting per-layer config behind an OpenAI-compatible API. bash 1-2 min, once per model $ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \ --target-bits 5.0 -o ./kv OpenAI + Anthropic compatible server on :8080 /v1/chat/completions OpenAI /v1/messages Anthropic; works with Claude Code, anthropic SDK, etc. $ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \ --kv-config ./kv/kv config.json \ --port 8080 getting-started guide /docs/qwen3.5 with model-specific sampling defaults and recommended use cases. Building an agent? Drop llms.txt /llms.txt into your IDE. It's the entire library reference in one Markdown file. One sensitivity signal. A whole toolkit around it. A single per-layer KL-divergence pass drives weight, KV-cache and LoRA-rank allocation. The rest of the toolkit hot-swap adapters, multi-protocol serving with five tested client integrations, image input on the vision models, and the OptIQ Lab GUI for quantize, fine-tune, dataset, and chat workflows sits around that core. Mixed-precision weights Per-layer KL on calibration data picks the bits. Sensitive layers stay high-precision, the rest go low, at the same average size as uniform-4. Mixed-precision KV cache A separate sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than average, so uniform 4-bit KV is catastrophic; mixed-precision is not. LoRA, two ways Fine-tune with adapter rank scaled by each layer's bits, then keep N adapters mounted on one base and switch them per request, no reload. Text and images, one stack Run text models, and send pictures to the ones that take vision. The vendored vision tower rides in a bf16 sidecar, so one repo loads text-only under mlx-lm or full image+text under OptIQ. Vision docs /docs/vision . OpenAI and Anthropic APIs optiq serve speaks both the OpenAI and Anthropic protocols from one process. Point Claude Code, Codex, OpenCode, OpenClaw, or Hermes Agent /docs/integrations/ at a local quant. OptIQ Lab, a local GUI A web UI for the whole workflow: quantize wizard, SFT/DPO fine-tuning with a dataset designer, and chat with sandboxed tools web search, Python, terminal and image upload. Sensitivity, in three steps . Uniform 4-bit quantization treats every layer the same, but layers are not the same. mlx-optiq measures, then allocates. 1. Measure For each layer, simulate-quantize just that layer at each candidate bit-width. Forward-pass calibration data. Measure KL divergence between the perturbed logits and the reference logits. Repeat for every layer; you now have a layer, bits → quality cost table. 2. Allocate Greedy knapsack on the table: start every layer at the lowest bit-width, then greedily upgrade the layer that buys the most KL-reduction per extra bit until the average bit-budget is exhausted. Layers like lm head and the first/last attention blocks are protected at 8-bit by default. 3. Convert Hand the per-layer bit map to mlx lm.convert as a quant predicate. The output is a standard MLX checkpoint that loads anywhere stock mlx-lm loads, with sensitivity metadata stashed on the side for downstream LoRA training. Auto-routes between bf16 and uniform-4-bit reference based on available RAM. $ optiq convert Qwen/Qwen3.5-9B \ --target-bpw 5.0 \ --candidate-bits 4,8 \ --reference auto \ -o optiq output/Qwen3.5-9B --reference auto picks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline with bf16-streaming probes, so 27 B+ models still get a calibration-driven signal on a 36 GB Mac. The full methodology lives in our research write-up → /blog/not-all-layers-are-equal Where mlx-optiq sits among the Mac LLM options. A snapshot of how the popular paths stack up on the things that actually move quality and speed on Apple Silicon. None of these are wrong; they're optimizing different axes. mlx-optiq | mlx-lm | llama.cpp | | |---|---|---|---| | Per-layer mixed-precision weights | Yes, calibration-driven | Uniform 4-bit | Block-wise K-quant | | Per-layer mixed-precision KV cache | Yes | Uniform 4 / 8 / fp16 | Group-wise int8 only | | Sensitivity-aware LoRA fine-tuning | Rank scaled by per-layer bits | Constant rank LoRA | Inference only | OpenAI and Anthropic compatible server | One process, both | OpenAI only | llama-server OpenAI shim | | Text and image input | Yes | Text only | Image via separate build | | Sandboxed tool support for chat | Three tools: web search, Python, terminal | — | — | Running LLMs on a Mac, answered . Can I run LLMs locally on a Mac? Yes. mlx-optiq runs large language models natively on Apple Silicon, from M1 to M5, using Apple's MLX framework. Install it from PyPI with pip install mlx-optiq , then quantize, fine-tune, and serve models entirely on your Mac, fully offline. Do I need a GPU or PyTorch to run LLMs on a Mac? No. There is no PyTorch and no discrete GPU in the path. mlx-optiq is MLX-native and uses the unified memory of Apple Silicon directly, so a MacBook, Mac mini, or Mac Studio is enough. No CUDA, no cloud, no API key. How much RAM do I need to run an LLM on a Mac? It depends on the model. A 4-bit OptIQ quant of a 4B model needs roughly 3 GB; a 9B needs about 6 GB; larger mixture-of-experts models need more. Mixed-precision 4-bit quantization is what lets bigger models fit in a Mac's memory while staying close to full-precision quality. What is mlx-optiq? An MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon. Its core is data-driven mixed-precision quantization: it measures each layer's sensitivity and assigns per-layer bit-widths, so quants keep more quality than uniform 4-bit at the same size. It also ships a local web UI OptIQ Lab and an OpenAI and Anthropic compatible server. Make your Mac an LLM workstation . Pick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.