Tuning CPU-only Qwen3-30B inference with an IBM Quantum sampling loop

A developer achieved 14.03 generation tokens per second running the Qwen3-30B-A3B-Instruct Mixture-of-Experts LLM on a 2017 Intel MacBook Air with only 8GB RAM and no GPU, using an IBM Quantum sampling loop to optimize hyperparameters. The quantum-enhanced workflow, which uses the QPU only for candidate selection while the local machine runs inference and judges output quality, represents a 155x improvement over the baseline 0.09 tokens per second. The project's benchmark harness, experiment logs, and paper draft are publicly available on GitHub and Hugging Face.

Quantum-enhanced autoresearch for high-performance, CPU-only Mixture-of-Experts LLM inference on legacy hardware. This repository contains the benchmark harness, MCP-style tool boundary, experiment logs, paper draft, and IBM Quantum candidate-sampling workflow from the 2017 Intel MacBook Air Qwen3 MoE project. Public pages: - GitHub: https://github.com/Shack870/qwen-air-qpu-mcp-lab https://github.com/Shack870/qwen-air-qpu-mcp-lab - GitHub preprint release: https://github.com/Shack870/qwen-air-qpu-mcp-lab/releases/tag/v0.1-preprint https://github.com/Shack870/qwen-air-qpu-mcp-lab/releases/tag/v0.1-preprint - Hugging Face collection: https://huggingface.co/collections/Shack870/qwen-air-qpu-mcp-lab-6a174dd8d752afe40a429846 https://huggingface.co/collections/Shack870/qwen-air-qpu-mcp-lab-6a174dd8d752afe40a429846 - Hugging Face dataset artifacts: https://huggingface.co/datasets/Shack870/qwen-air-qpu-mcp-lab https://huggingface.co/datasets/Shack870/qwen-air-qpu-mcp-lab - Hugging Face interactive dashboard Space: https://huggingface.co/spaces/Shack870/qwen-air-qpu-dashboard https://huggingface.co/spaces/Shack870/qwen-air-qpu-dashboard The short version: - Model: Qwen3-30B-A3B-Instruct-2507-GGUF , Q3 K S 2.66bpw - Hardware: 2017 Intel MacBook Air, 8GB RAM, CPU-only - Context: 16,384 tokens - Starting point: about 0.09 generation tokens/sec - Classical systems optimization frontier: 6.49 generation tokens/sec - First IBM Quantum-informed breakthrough: 13.12 generation tokens/sec - Strict quality-gated record: 14.03 generation tokens/sec - Clean-room Codex-off check: 13.91 generation tokens/sec - Speed-only rejected lane: 16.53 generation tokens/sec, not claimed because output coherence failed This is not a claim that an IBM QPU ran Qwen. It did not. The core contribution is the synchronized loop: php Human Experimenter sets the goal and constraints - Codex proposes, edits, runs, logs, and interprets experiments - the MacBook runs real llama.cpp inference and judges candidates - the local database scores the run frontier - compact candidate choices are compressed into QUBO form - IBM Quantum samples candidate bitstrings - Codex decodes those bitstrings into concrete llama.cpp configs - the MacBook tests them - the loop repeats The QPU improves the research loop's candidate selection. The MacBook remains the judge. The model remains local. The result is a small hybrid quantum optimization lab for routed MoE inference. See the paper draft: Quantum-Enhanced Hyperparameter Tuning for High-Performance On-Device CPU-Only Inference of Mixture-of-Experts LLMs on Legacy Hardware /Shack870/qwen-air-qpu-mcp-lab/blob/main/paper/quantum enhanced legacy moe inference.md Generated preprint PDF /Shack870/qwen-air-qpu-mcp-lab/blob/main/paper/quantum enhanced legacy moe inference.pdf paper/ - paper draft, selected run snapshots, and generated SVG figures paper/data/qpu lab public.sqlite - sanitized public SQLite benchmark and QPU job database paper/data/public runs.csv - sanitized public run log powering the Space dashboard qpu mcp lab/ - benchmark harness, objective scorer, optimizer, QUBO builder, IBM Quantum adapter, and MCP-style server huggingface/space/ - Gradio leaderboard and config explorer source scripts/ - experiment drivers and reproducibility scripts docs/REPRODUCIBILITY.md - validation protocol docs/COMMUNITY VALIDATION.md - guide for outside benchmark reports docs/HUGGINGFACE BLOG DRAFT.md - draft article for the Hugging Face Blog editor docs/PRESS KIT.md - concise public launch material docs/RESULTS.md - result narrative and milestone summary SECURITY.md - secret handling and QPU guardrails config.example.json - local config template This repo does not include model weights or a compiled llama-cli . You need: - Python 3.11 or newer - a local llama-cli or compatible fork build - the ByteShape GGUF model file: Qwen3-30B-A3B-Instruct-2507-Q3 K S-2.66bpw.gguf - optional IBM Quantum credentials for real QPU jobs Reference local paths from the original lab: ~/src/ik llama.cpp/build-air-iqk-lean/bin/llama-cli ~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3 K S-2.66bpw.gguf git clone https://github.com/Shack870/qwen-air-qpu-mcp-lab.git cd qwen-air-qpu-mcp-lab python3 -m venv .venv . .venv/bin/activate pip install -r requirements.txt cp config.example.json config.json Edit config.json : { "llama bin": "~/src/ik llama.cpp/build-air-iqk-lean/bin/llama-cli", "model path": "~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3 K S-2.66bpw.gguf", "llama repo": "~/src/ik llama.cpp", "safe memory gb": 6.5, "default backend": "local-simulator", "allow real qpu jobs by default": false } You can also provide paths through environment variables: export QPU MCP LAB LLAMA BIN="$HOME/src/ik llama.cpp/build-air-iqk-lean/bin/llama-cli" export QPU MCP LAB MODEL PATH="$HOME/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3 K S-2.66bpw.gguf" Validate the environment: .venv/bin/python scripts/validate environment.py Initialize the database: .venv/bin/python -m qpu mcp lab.cli init-db Run the record-family config: .venv/bin/python -m qpu mcp lab.cli run --config-json '{ "label": "strict record reproduction", "prompt": "<|im start| user\nContinue this comma-separated list of Mars facts: red planet, thin atmosphere,<|im end| \n<|im start| assistant\n", "ctx size": 16384, "batch size": 2456, "ubatch size": 144, "threads": 4, "threads batch": 4, "cache type k": "q6 0", "cache type v": "q6 0", "flash attn": true, "smart expert reduction": "3,1", "env veclib threads": 1, "env omp wait policy": "ACTIVE", "env omp dynamic": "FALSE", "env ser cheap ranges": "24:30", "env ser cheap min": 2, "env ser cheap thresh": 1.0, "n predict": 128, "temp": 0.0, "ignore eos": true, "no display prompt": true, "timeout seconds": 420 }' Reference results from the original machine: - strict record: 14.03 tok/s - clean-room lane: 13.91 tok/s - first QPU-informed jump: 13.12 tok/s - classical frontier before QPU sampling: 6.49 tok/s - original proof-of-life baseline: about 0.09 tok/s Exact repeats vary with thermals, page-cache state, context switches, and prompt shape. Report both throughput and output quality. A speed result is not a quality result unless the output remains coherent. The strict gate used short factual/code prompts such as: What is the capital of Serbia? What is the capital of Mars? Write a compact Python function named is prime that checks whether n is prime. Known pattern: - broad speed-only expert reductions can produce high tokens/sec and broken text - the accepted record lane is lower than the fastest raw lane because it preserves coherence Do not put IBM API keys in Git, config.json , .env , shell history, screenshots, paper drafts, logs, or chat messages. Preferred macOS setup: ./scripts/store ibm key.sh That script prompts for the key without echoing it and stores it in macOS Keychain under: ibm quantum api key - optional ibm quantum instance crn The harness reads credentials in this order: IBM QUANTUM API KEY , then Keychain service ibm quantum api key IBM QUANTUM INSTANCE , then Keychain service ibm quantum instance crn Temporary environment-variable setup also works: export IBM QUANTUM API KEY="paste-token-here" export IBM QUANTUM INSTANCE="optional-instance-or-crn" For safety, Keychain storage is preferred. Check credential status without printing secrets: .venv/bin/python -m qpu mcp lab.cli quantum-credentials List available IBM backends: .venv/bin/python -m qpu mcp lab.cli quantum-backends Real QPU submission is guarded. The harness defaults to dry-run or local simulation unless the command includes --allow-real-qpu . Example guarded workflow: .venv/bin/python -m qpu mcp lab.cli build-qubo .venv/bin/python -m qpu mcp lab.cli sweep-qaoa-angles --limit 5 .venv/bin/python -m qpu mcp lab.cli submit-micro-frontier \ --backend ibm fez \ --shots 256 \ --allow-real-qpu After an IBM job completes: .venv/bin/python -m qpu mcp lab.cli quantum-jobs --limit 5 .venv/bin/python -m qpu mcp lab.cli job-result JOB ID --refresh .venv/bin/python -m qpu mcp lab.cli decode-job-candidates JOB ID --top-k 12 The decoded candidates still need to be tested locally. The QPU suggests; the MacBook judges. The local MCP-style server exposes narrow, auditable tools for Codex or other clients. It does not expose arbitrary shell access and it does not return secret values. ./scripts/run mcp server.sh Representative tool categories: bench run config bench get best runs objective score run optimizer build qubo optimizer propose classical candidates quantum credential status quantum list backends quantum submit micro frontier job quantum decode job candidates Regenerate the SVG figures: python3 paper/make figures.py Generated figures: paper/figures/throughput progression.svg paper/figures/qpu jump.svg paper/figures/quality boundary.svg paper/figures/prompt examples.svg - Model weights are not included. - IBM secrets are not included. config.json , .env , logs, SQLite WAL/SHM files, and local model files are ignored by Git.- Real IBM QPU use requires an explicit --allow-real-qpu flag. - Publish benchmark claims with command, output, quality gate, context length, page faults, swaps, and system state. This project was shaped by: - Dan Woods' Flash-MoE work on SSD-backed MoE inference - Andrej Karpathy's autoresearch loop - ByteShape and Potato OS Raspberry Pi Qwen3-30B-A3B demonstrations - IBM Quantum and Qiskit Runtime candidate sampling - Codex/GPT-5 as the research loop collaborator and experiment agent See CITATION.cff /Shack870/qwen-air-qpu-mcp-lab/blob/main/CITATION.cff .