Quantum-enhanced autoresearch for high-performance, CPU-only Mixture-of-Experts LLM inference on legacy hardware.
This repository contains the benchmark harness, MCP-style tool boundary, experiment logs, paper draft, and IBM Quantum candidate-sampling workflow from the 2017 Intel MacBook Air Qwen3 MoE project.
Public pages:
- GitHub: https://github.com/Shack870/qwen-air-qpu-mcp-lab - GitHub preprint release: https://github.com/Shack870/qwen-air-qpu-mcp-lab/releases/tag/v0.1-preprint - Hugging Face collection: https://huggingface.co/collections/Shack870/qwen-air-qpu-mcp-lab-6a174dd8d752afe40a429846 - Hugging Face dataset artifacts: https://huggingface.co/datasets/Shack870/qwen-air-qpu-mcp-lab - Hugging Face interactive dashboard Space: https://huggingface.co/spaces/Shack870/qwen-air-qpu-dashboard
The short version:
- Model:
Qwen3-30B-A3B-Instruct-2507-GGUF
,Q3_K_S 2.66bpw
- Hardware: 2017 Intel MacBook Air, 8GB RAM, CPU-only
- Context: 16,384 tokens
- Starting point: about
0.09
generation tokens/sec - Classical systems optimization frontier:
6.49
generation tokens/sec - First IBM Quantum-informed breakthrough:
13.12
generation tokens/sec - Strict quality-gated record:
14.03
generation tokens/sec - Clean-room Codex-off check:
13.91
generation tokens/sec - Speed-only rejected lane:
16.53
generation tokens/sec, not claimed because output coherence failed
This is not a claim that an IBM QPU ran Qwen. It did not.
The core contribution is the synchronized loop:
Human Experimenter sets the goal and constraints
-> Codex proposes, edits, runs, logs, and interprets experiments
-> the MacBook runs real llama.cpp inference and judges candidates
-> the local database scores the run frontier
-> compact candidate choices are compressed into QUBO form
-> IBM Quantum samples candidate bitstrings
-> Codex decodes those bitstrings into concrete llama.cpp configs
-> the MacBook tests them
-> the loop repeats
The QPU improves the research loop's candidate selection. The MacBook remains the judge. The model remains local. The result is a small hybrid quantum optimization lab for routed MoE inference.
See the paper draft:
Quantum-Enhanced Hyperparameter Tuning for High-Performance On-Device CPU-Only Inference of Mixture-of-Experts LLMs on Legacy HardwareGenerated preprint PDF
paper/
-
paper draft, selected run snapshots, and generated SVG figures
paper/data/qpu_lab_public.sqlite -
sanitized public SQLite benchmark and QPU job database
paper/data/public_runs.csv -
sanitized public run log powering the Space dashboard
qpu_mcp_lab/ -
benchmark harness, objective scorer, optimizer, QUBO builder, IBM Quantum adapter, and MCP-style server
huggingface/space/ -
Gradio leaderboard and config explorer source
scripts/ -
experiment drivers and reproducibility scripts
docs/REPRODUCIBILITY.md -
validation protocol
docs/COMMUNITY_VALIDATION.md -
guide for outside benchmark reports
docs/HUGGINGFACE_BLOG_DRAFT.md -
draft article for the Hugging Face Blog editor
docs/PRESS_KIT.md -
concise public launch material
docs/RESULTS.md -
result narrative and milestone summary
SECURITY.md -
secret handling and QPU guardrails
config.example.json -
local config template
This repo does not include model weights or a compiled llama-cli
.
You need:
- Python 3.11 or newer
- a local
llama-cli
or compatible fork build - the ByteShape GGUF model file:
Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf
- optional IBM Quantum credentials for real QPU jobs
Reference local paths from the original lab:
~/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli
~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf
git clone https://github.com/Shack870/qwen-air-qpu-mcp-lab.git
cd qwen-air-qpu-mcp-lab
python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
cp config.example.json config.json
Edit config.json
:
{
"llama_bin": "~/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli",
"model_path": "~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf",
"llama_repo": "~/src/ik_llama.cpp",
"safe_memory_gb": 6.5,
"default_backend": "local-simulator",
"allow_real_qpu_jobs_by_default": false
}
You can also provide paths through environment variables:
export QPU_MCP_LAB_LLAMA_BIN="$HOME/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli"
export QPU_MCP_LAB_MODEL_PATH="$HOME/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf"
Validate the environment:
.venv/bin/python scripts/validate_environment.py
Initialize the database:
.venv/bin/python -m qpu_mcp_lab.cli init-db
Run the record-family config:
.venv/bin/python -m qpu_mcp_lab.cli run --config-json '{
"label": "strict_record_reproduction",
"prompt": "<|im_start|>user\nContinue this comma-separated list of Mars facts: red planet, thin atmosphere,<|im_end|>\n<|im_start|>assistant\n",
"ctx_size": 16384,
"batch_size": 2456,
"ubatch_size": 144,
"threads": 4,
"threads_batch": 4,
"cache_type_k": "q6_0",
"cache_type_v": "q6_0",
"flash_attn": true,
"smart_expert_reduction": "3,1",
"env_veclib_threads": 1,
"env_omp_wait_policy": "ACTIVE",
"env_omp_dynamic": "FALSE",
"env_ser_cheap_ranges": "24:30",
"env_ser_cheap_min": 2,
"env_ser_cheap_thresh": 1.0,
"n_predict": 128,
"temp": 0.0,
"ignore_eos": true,
"no_display_prompt": true,
"timeout_seconds": 420
}'
Reference results from the original machine:
-
strict record:
14.03 tok/s -
clean-room lane:
13.91 tok/s -
first QPU-informed jump:
13.12 tok/s -
classical frontier before QPU sampling:
6.49 tok/s -
original proof-of-life baseline: about
0.09 tok/s
Exact repeats vary with thermals, page-cache state, context switches, and prompt shape. Report both throughput and output quality.
A speed result is not a quality result unless the output remains coherent.
The strict gate used short factual/code prompts such as:
What is the capital of Serbia?
What is the capital of Mars?
Write a compact Python function named is_prime that checks whether n is prime.
Known pattern:
- broad speed-only expert reductions can produce high tokens/sec and broken text
- the accepted record lane is lower than the fastest raw lane because it preserves coherence
Do not put IBM API keys in Git, config.json
, .env
, shell history, screenshots, paper drafts, logs, or chat messages.
Preferred macOS setup:
./scripts/store_ibm_key.sh
That script prompts for the key without echoing it and stores it in macOS Keychain under:
ibm_quantum_api_key
- optional
ibm_quantum_instance_crn
The harness reads credentials in this order:
IBM_QUANTUM_API_KEY
, then Keychain serviceibm_quantum_api_key
IBM_QUANTUM_INSTANCE
, then Keychain serviceibm_quantum_instance_crn
Temporary environment-variable setup also works:
export IBM_QUANTUM_API_KEY="paste-token-here"
export IBM_QUANTUM_INSTANCE="optional-instance-or-crn"
For safety, Keychain storage is preferred.
Check credential status without printing secrets:
.venv/bin/python -m qpu_mcp_lab.cli quantum-credentials
List available IBM backends:
.venv/bin/python -m qpu_mcp_lab.cli quantum-backends
Real QPU submission is guarded. The harness defaults to dry-run or local
simulation unless the command includes --allow-real-qpu
.
Example guarded workflow:
.venv/bin/python -m qpu_mcp_lab.cli build-qubo
.venv/bin/python -m qpu_mcp_lab.cli sweep-qaoa-angles --limit 5
.venv/bin/python -m qpu_mcp_lab.cli submit-micro-frontier \
--backend ibm_fez \
--shots 256 \
--allow-real-qpu
After an IBM job completes:
.venv/bin/python -m qpu_mcp_lab.cli quantum-jobs --limit 5
.venv/bin/python -m qpu_mcp_lab.cli job-result JOB_ID --refresh
.venv/bin/python -m qpu_mcp_lab.cli decode-job-candidates JOB_ID --top-k 12
The decoded candidates still need to be tested locally. The QPU suggests; the MacBook judges.
The local MCP-style server exposes narrow, auditable tools for Codex or other clients. It does not expose arbitrary shell access and it does not return secret values.
./scripts/run_mcp_server.sh
Representative tool categories:
bench_run_config
bench_get_best_runs
objective_score_run
optimizer_build_qubo
optimizer_propose_classical_candidates
quantum_credential_status
quantum_list_backends
quantum_submit_micro_frontier_job
quantum_decode_job_candidates
Regenerate the SVG figures:
python3 paper/make_figures.py
Generated figures:
paper/figures/throughput_progression.svg
paper/figures/qpu_jump.svg
paper/figures/quality_boundary.svg
paper/figures/prompt_examples.svg
- Model weights are not included.
- IBM secrets are not included.
config.json
,.env
, logs, SQLite WAL/SHM files, and local model files are ignored by Git.- Real IBM QPU use requires an explicit
--allow-real-qpu
flag. - Publish benchmark claims with command, output, quality gate, context length, page faults, swaps, and system state.
This project was shaped by:
- Dan Woods' Flash-MoE work on SSD-backed MoE inference
- Andrej Karpathy's autoresearch loop
- ByteShape and Potato OS Raspberry Pi Qwen3-30B-A3B demonstrations
- IBM Quantum and Qiskit Runtime candidate sampling
- Codex/GPT-5 as the research loop collaborator and experiment agent
See CITATION.cff.