# Mlx-optiq: per-layer mixed-precision LLM quantization for Apple Silicon

> Source: <https://mlx-optiq.com/>
> Published: 2026-06-14 16:20:43+00:00

#
Quantize, *fine-tune*

and serve LLMs

entirely on Apple Silicon.

Run large language models locally on your Mac, from M1 to M5. Per-layer sensitivity analysis for mixed-precision weights. LoRA fine-tuning that respects the bit budget. A server that speaks both OpenAI and Anthropic APIs (point Claude Code at your local quant). Send it an image, not just text, on any vision-capable model. No GPU cluster, no API key.

`$ pip install mlx-optiq`

## Drop-in *4-bit* quants. Same weights, smarter bits.

Sixteen production mlx-optiq-quantized LLMs on Hugging Face. Nemotron 3, MiniCPM5, Qwen3.5, Qwen3.6 and Gemma-4 families, from 1 B dense to 35 B-A3B mixture-of-experts. They load directly into stock `mlx-lm`

. No special runtime.

[Gemma-4 · new
](https://huggingface.co/mlx-community/gemma-4-12B-it-OptiQ-4bit)

### gemma-4-12B-it-OptiQ-4bit

Google's unified text+vision Gemma-4, at 8.3 GB, with image input. Capability Score 68.2 (+6.4 vs uniform-4-bit), one of our largest mixed-precision gains, and the strongest model we ship under 9 GB on disk.

**8.3 GB** on disk

**68.2** Capability

**+6.4** vs U4

[Gemma-4
](https://huggingface.co/mlx-community/gemma-4-31B-it-OptiQ-4bit)

### gemma-4-31B-it-OptiQ-4bit

The largest single quant we ship. 31 B parameters in 20.8 GB with Capability Score 79.7 (+3.5 vs uniform-4-bit). Pair with the matching `-assistant-bf16`

drafter for speculative decoding.

**20.8 GB** on disk

**79.7** Capability

**+3.5** vs U4

[Qwen3.6
](https://huggingface.co/mlx-community/Qwen3.6-27B-OptiQ-4bit)

### Qwen3.6-27B-OptiQ-4bit

Frontier-class reasoning at 17.5 GB with our highest Capability Score (83.0). Bundled MTP head gives ~1.4× decode via `optiq serve --mtp`

.

**17.5 GB** on disk

**83.0** Capability

**+0.5** vs U4

[Qwen3.5
](https://huggingface.co/mlx-community/Qwen3.5-9B-OptiQ-4bit)

### Qwen3.5-9B-OptiQ-4bit

The default daily-driver. 9 B parameters in 6.6 GB. Capability Score 66.8 (+0.2 vs uniform-4-bit). Long context to 64 k via mixed-precision KV; bundled MTP head for speculative decoding.

**6.6 GB** on disk

**66.8** Capability

**+0.2** vs U4

## From *zero* to a serving LLM in three commands.

Each step is reversible and works with stock MLX tools. mlx-optiq is additive. Skip any of these and you still have a working pipeline.

### Install

Pure Python. Pulls in `mlx`

, `mlx-lm`

and `huggingface-hub`

. Python 3.11+ on Apple Silicon.

``` bash
$ pip install mlx-optiq
```

### Use a pre-built quant

Pre-built mlx-optiq quants load with stock `mlx-lm`

. Per-layer bit assignment is recorded in the model metadata. No special loader required.

``` python
from mlx_lm import load, generate

model, tok = load("mlx-community/Qwen3.5-9B-OptiQ-4bit")
out = generate(model, tok, prompt="Explain mixed-precision quantization.", max_tokens=200)
print(out)
```

### Serve with mixed-precision KV

The KV cache is its own sensitivity problem. `optiq kv-cache`

measures it once per model; `optiq serve`

serves with the resulting per-layer config behind an OpenAI-compatible API.

``` bash
# 1-2 min, once per model
$ optiq kv-cache mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --target-bits 5.0 -o ./kv

# OpenAI + Anthropic compatible server on :8080
# /v1/chat/completions  (OpenAI)
# /v1/messages          (Anthropic; works with Claude Code, anthropic SDK, etc.)
$ optiq serve --model mlx-community/Qwen3.5-9B-OptiQ-4bit \
    --kv-config ./kv/kv_config.json \
    --port 8080
```

[getting-started guide](/docs/qwen3.5)with model-specific sampling defaults and recommended use cases. Building an agent? Drop

[llms.txt](/llms.txt)into your IDE. It's the entire library reference in one Markdown file.

## One sensitivity signal. *A whole toolkit* around it.

A single per-layer KL-divergence pass drives weight, KV-cache and LoRA-rank allocation. The rest of the toolkit (hot-swap adapters, multi-protocol serving with five tested client integrations, image input on the vision models, and the OptIQ Lab GUI for quantize, fine-tune, dataset, and chat workflows) sits around that core.

### Mixed-precision weights

Per-layer KL on calibration data picks the bits. Sensitive layers stay high-precision, the rest go low, at the same average size as uniform-4.

### Mixed-precision KV cache

A separate sensitivity pass on the KV cache. Layer 0 is often 56× more sensitive than average, so uniform 4-bit KV is catastrophic; mixed-precision is not.

### LoRA, two ways

Fine-tune with adapter rank scaled by each layer's bits, then keep N adapters mounted on one base and switch them per request, no reload.

### Text and images, one stack

Run text models, and send pictures to the ones that take vision. The vendored vision tower rides in a bf16 sidecar, so one repo loads text-only under `mlx-lm`

or full image+text under OptIQ. [Vision docs](/docs/vision).

### OpenAI and Anthropic APIs

`optiq serve`

speaks both the OpenAI and Anthropic protocols from one process. Point [Claude Code, Codex, OpenCode, OpenClaw, or Hermes Agent](/docs/integrations/) at a local quant.

### OptIQ Lab, a local GUI

A web UI for the whole workflow: quantize wizard, SFT/DPO fine-tuning with a dataset designer, and chat with sandboxed tools (web search, Python, terminal) and image upload.

## Sensitivity, in *three steps*.

Uniform 4-bit quantization treats every layer the same, but layers are not the same. mlx-optiq measures, then allocates.

### 1. Measure

For each layer, simulate-quantize *just that layer* at each candidate bit-width.
Forward-pass calibration data. Measure KL divergence between the perturbed
logits and the reference logits. Repeat for every layer; you now have a
*(layer, bits) → quality cost* table.

### 2. Allocate

Greedy knapsack on the table: start every layer at the lowest bit-width,
then greedily upgrade the layer that buys the most KL-reduction per extra bit
until the average bit-budget is exhausted. Layers like `lm_head`

and the first/last attention blocks are protected at 8-bit by default.

### 3. Convert

Hand the per-layer bit map to `mlx_lm.convert`

as a quant
predicate. The output is a standard MLX checkpoint that loads anywhere
stock `mlx-lm`

loads, with sensitivity metadata stashed on
the side for downstream LoRA training.

```
# Auto-routes between bf16 and uniform-4-bit reference
# based on available RAM.
$ optiq convert Qwen/Qwen3.5-9B \
    --target-bpw 5.0 \
    --candidate-bits 4,8 \
    --reference auto \
    -o optiq_output/Qwen3.5-9B
```

`--reference auto`

picks bf16 if it fits, otherwise falls back to a uniform-4-bit baseline
with bf16-streaming probes, so 27 B+ models still get a calibration-driven
signal on a 36 GB Mac. The full methodology lives in
[our research write-up →](/blog/not-all-layers-are-equal)

## Where mlx-optiq sits among the *Mac LLM* options.

A snapshot of how the popular paths stack up on the things that actually move quality and speed on Apple Silicon. None of these are wrong; they're optimizing different axes.

mlx-optiq |
mlx-lm | llama.cpp | |
|---|---|---|---|
| Per-layer mixed-precision weights | Yes, calibration-driven | Uniform 4-bit | Block-wise K-quant |
| Per-layer mixed-precision KV cache | Yes | Uniform 4 / 8 / fp16 | Group-wise int8 only |
| Sensitivity-aware LoRA fine-tuning | Rank scaled by per-layer bits | Constant rank LoRA | Inference only |
OpenAI and Anthropic compatible server |
One process, both | OpenAI only | llama-server (OpenAI shim) |
| Text and image input | Yes | Text only | Image via separate build |
| Sandboxed tool support for chat | Three tools: web search, Python, terminal | — | — |

## Running LLMs on a Mac, *answered*.

### Can I run LLMs locally on a Mac?

Yes. mlx-optiq runs large language models natively on Apple Silicon, from M1 to M5, using Apple's MLX framework. Install it from PyPI with `pip install mlx-optiq`

, then quantize, fine-tune, and serve models entirely on your Mac, fully offline.

### Do I need a GPU or PyTorch to run LLMs on a Mac?

No. There is no PyTorch and no discrete GPU in the path. mlx-optiq is MLX-native and uses the unified memory of Apple Silicon directly, so a MacBook, Mac mini, or Mac Studio is enough. No CUDA, no cloud, no API key.

### How much RAM do I need to run an LLM on a Mac?

It depends on the model. A 4-bit OptIQ quant of a 4B model needs roughly 3 GB; a 9B needs about 6 GB; larger mixture-of-experts models need more. Mixed-precision 4-bit quantization is what lets bigger models fit in a Mac's memory while staying close to full-precision quality.

### What is mlx-optiq?

An MLX-native toolkit to quantize, fine-tune, and serve LLMs locally on Apple Silicon. Its core is data-driven mixed-precision quantization: it measures each layer's sensitivity and assigns per-layer bit-widths, so quants keep more quality than uniform 4-bit at the same size. It also ships a local web UI (OptIQ Lab) and an OpenAI and Anthropic compatible server.

## Make your Mac an *LLM workstation*.

Pick a model, get a snippet, ship it. The docs cover every supported family, fine-tuning recipes, and the OpenAI-compatible serving stack.
