{"slug": "tuning-cpu-only-qwen3-30b-inference-with-an-ibm-quantum-sampling-loop", "title": "Tuning CPU-only Qwen3-30B inference with an IBM Quantum sampling loop", "summary": "A developer achieved 14.03 generation tokens per second running the Qwen3-30B-A3B-Instruct Mixture-of-Experts LLM on a 2017 Intel MacBook Air with only 8GB RAM and no GPU, using an IBM Quantum sampling loop to optimize hyperparameters. The quantum-enhanced workflow, which uses the QPU only for candidate selection while the local machine runs inference and judges output quality, represents a 155x improvement over the baseline 0.09 tokens per second. The project's benchmark harness, experiment logs, and paper draft are publicly available on GitHub and Hugging Face.", "body_md": "Quantum-enhanced autoresearch for high-performance, CPU-only Mixture-of-Experts LLM inference on legacy hardware.\n\nThis repository contains the benchmark harness, MCP-style tool boundary, experiment logs, paper draft, and IBM Quantum candidate-sampling workflow from the 2017 Intel MacBook Air Qwen3 MoE project.\n\nPublic pages:\n\n- GitHub:\n[https://github.com/Shack870/qwen-air-qpu-mcp-lab](https://github.com/Shack870/qwen-air-qpu-mcp-lab) - GitHub preprint release:\n[https://github.com/Shack870/qwen-air-qpu-mcp-lab/releases/tag/v0.1-preprint](https://github.com/Shack870/qwen-air-qpu-mcp-lab/releases/tag/v0.1-preprint) - Hugging Face collection:\n[https://huggingface.co/collections/Shack870/qwen-air-qpu-mcp-lab-6a174dd8d752afe40a429846](https://huggingface.co/collections/Shack870/qwen-air-qpu-mcp-lab-6a174dd8d752afe40a429846) - Hugging Face dataset artifacts:\n[https://huggingface.co/datasets/Shack870/qwen-air-qpu-mcp-lab](https://huggingface.co/datasets/Shack870/qwen-air-qpu-mcp-lab) - Hugging Face interactive dashboard Space:\n[https://huggingface.co/spaces/Shack870/qwen-air-qpu-dashboard](https://huggingface.co/spaces/Shack870/qwen-air-qpu-dashboard)\n\nThe short version:\n\n- Model:\n`Qwen3-30B-A3B-Instruct-2507-GGUF`\n\n,`Q3_K_S 2.66bpw`\n\n- Hardware: 2017 Intel MacBook Air, 8GB RAM, CPU-only\n- Context: 16,384 tokens\n- Starting point: about\n`0.09`\n\ngeneration tokens/sec - Classical systems optimization frontier:\n`6.49`\n\ngeneration tokens/sec - First IBM Quantum-informed breakthrough:\n`13.12`\n\ngeneration tokens/sec - Strict quality-gated record:\n`14.03`\n\ngeneration tokens/sec - Clean-room Codex-off check:\n`13.91`\n\ngeneration tokens/sec - Speed-only rejected lane:\n`16.53`\n\ngeneration tokens/sec, not claimed because output coherence failed\n\nThis is not a claim that an IBM QPU ran Qwen. It did not.\n\nThe core contribution is the synchronized loop:\n\n``` php\nHuman Experimenter sets the goal and constraints\n    -> Codex proposes, edits, runs, logs, and interprets experiments\n    -> the MacBook runs real llama.cpp inference and judges candidates\n    -> the local database scores the run frontier\n    -> compact candidate choices are compressed into QUBO form\n    -> IBM Quantum samples candidate bitstrings\n    -> Codex decodes those bitstrings into concrete llama.cpp configs\n    -> the MacBook tests them\n    -> the loop repeats\n```\n\nThe QPU improves the research loop's candidate selection. The MacBook remains the judge. The model remains local. The result is a small hybrid quantum optimization lab for routed MoE inference.\n\nSee the paper draft:\n\n[Quantum-Enhanced Hyperparameter Tuning for High-Performance On-Device CPU-Only Inference of Mixture-of-Experts LLMs on Legacy Hardware](/Shack870/qwen-air-qpu-mcp-lab/blob/main/paper/quantum_enhanced_legacy_moe_inference.md)[Generated preprint PDF](/Shack870/qwen-air-qpu-mcp-lab/blob/main/paper/quantum_enhanced_legacy_moe_inference.pdf)\n\n`paper/`\n\n- paper draft, selected run snapshots, and generated SVG figures`paper/data/qpu_lab_public.sqlite`\n\n- sanitized public SQLite benchmark and QPU job database`paper/data/public_runs.csv`\n\n- sanitized public run log powering the Space dashboard`qpu_mcp_lab/`\n\n- benchmark harness, objective scorer, optimizer, QUBO builder, IBM Quantum adapter, and MCP-style server`huggingface/space/`\n\n- Gradio leaderboard and config explorer source`scripts/`\n\n- experiment drivers and reproducibility scripts`docs/REPRODUCIBILITY.md`\n\n- validation protocol`docs/COMMUNITY_VALIDATION.md`\n\n- guide for outside benchmark reports`docs/HUGGINGFACE_BLOG_DRAFT.md`\n\n- draft article for the Hugging Face Blog editor`docs/PRESS_KIT.md`\n\n- concise public launch material`docs/RESULTS.md`\n\n- result narrative and milestone summary`SECURITY.md`\n\n- secret handling and QPU guardrails`config.example.json`\n\n- local config template\n\nThis repo does not include model weights or a compiled `llama-cli`\n\n.\n\nYou need:\n\n- Python 3.11 or newer\n- a local\n`llama-cli`\n\nor compatible fork build - the ByteShape GGUF model file:\n`Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf`\n\n- optional IBM Quantum credentials for real QPU jobs\n\nReference local paths from the original lab:\n\n```\n~/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli\n~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf\ngit clone https://github.com/Shack870/qwen-air-qpu-mcp-lab.git\ncd qwen-air-qpu-mcp-lab\n\npython3 -m venv .venv\n. .venv/bin/activate\npip install -r requirements.txt\n\ncp config.example.json config.json\n```\n\nEdit `config.json`\n\n:\n\n```\n{\n  \"llama_bin\": \"~/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli\",\n  \"model_path\": \"~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf\",\n  \"llama_repo\": \"~/src/ik_llama.cpp\",\n  \"safe_memory_gb\": 6.5,\n  \"default_backend\": \"local-simulator\",\n  \"allow_real_qpu_jobs_by_default\": false\n}\n```\n\nYou can also provide paths through environment variables:\n\n```\nexport QPU_MCP_LAB_LLAMA_BIN=\"$HOME/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli\"\nexport QPU_MCP_LAB_MODEL_PATH=\"$HOME/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf\"\n```\n\nValidate the environment:\n\n```\n.venv/bin/python scripts/validate_environment.py\n```\n\nInitialize the database:\n\n```\n.venv/bin/python -m qpu_mcp_lab.cli init-db\n```\n\nRun the record-family config:\n\n```\n.venv/bin/python -m qpu_mcp_lab.cli run --config-json '{\n  \"label\": \"strict_record_reproduction\",\n  \"prompt\": \"<|im_start|>user\\nContinue this comma-separated list of Mars facts: red planet, thin atmosphere,<|im_end|>\\n<|im_start|>assistant\\n\",\n  \"ctx_size\": 16384,\n  \"batch_size\": 2456,\n  \"ubatch_size\": 144,\n  \"threads\": 4,\n  \"threads_batch\": 4,\n  \"cache_type_k\": \"q6_0\",\n  \"cache_type_v\": \"q6_0\",\n  \"flash_attn\": true,\n  \"smart_expert_reduction\": \"3,1\",\n  \"env_veclib_threads\": 1,\n  \"env_omp_wait_policy\": \"ACTIVE\",\n  \"env_omp_dynamic\": \"FALSE\",\n  \"env_ser_cheap_ranges\": \"24:30\",\n  \"env_ser_cheap_min\": 2,\n  \"env_ser_cheap_thresh\": 1.0,\n  \"n_predict\": 128,\n  \"temp\": 0.0,\n  \"ignore_eos\": true,\n  \"no_display_prompt\": true,\n  \"timeout_seconds\": 420\n}'\n```\n\nReference results from the original machine:\n\n- strict record:\n`14.03 tok/s`\n\n- clean-room lane:\n`13.91 tok/s`\n\n- first QPU-informed jump:\n`13.12 tok/s`\n\n- classical frontier before QPU sampling:\n`6.49 tok/s`\n\n- original proof-of-life baseline: about\n`0.09 tok/s`\n\nExact repeats vary with thermals, page-cache state, context switches, and prompt shape. Report both throughput and output quality.\n\nA speed result is not a quality result unless the output remains coherent.\n\nThe strict gate used short factual/code prompts such as:\n\n`What is the capital of Serbia?`\n\n`What is the capital of Mars?`\n\n`Write a compact Python function named is_prime that checks whether n is prime.`\n\nKnown pattern:\n\n- broad speed-only expert reductions can produce high tokens/sec and broken text\n- the accepted record lane is lower than the fastest raw lane because it preserves coherence\n\nDo not put IBM API keys in Git, `config.json`\n\n, `.env`\n\n, shell history, screenshots,\npaper drafts, logs, or chat messages.\n\nPreferred macOS setup:\n\n```\n./scripts/store_ibm_key.sh\n```\n\nThat script prompts for the key without echoing it and stores it in macOS Keychain under:\n\n`ibm_quantum_api_key`\n\n- optional\n`ibm_quantum_instance_crn`\n\nThe harness reads credentials in this order:\n\n`IBM_QUANTUM_API_KEY`\n\n, then Keychain service`ibm_quantum_api_key`\n\n`IBM_QUANTUM_INSTANCE`\n\n, then Keychain service`ibm_quantum_instance_crn`\n\nTemporary environment-variable setup also works:\n\n```\nexport IBM_QUANTUM_API_KEY=\"paste-token-here\"\nexport IBM_QUANTUM_INSTANCE=\"optional-instance-or-crn\"\n```\n\nFor safety, Keychain storage is preferred.\n\nCheck credential status without printing secrets:\n\n```\n.venv/bin/python -m qpu_mcp_lab.cli quantum-credentials\n```\n\nList available IBM backends:\n\n```\n.venv/bin/python -m qpu_mcp_lab.cli quantum-backends\n```\n\nReal QPU submission is guarded. The harness defaults to dry-run or local\nsimulation unless the command includes `--allow-real-qpu`\n\n.\n\nExample guarded workflow:\n\n```\n.venv/bin/python -m qpu_mcp_lab.cli build-qubo\n.venv/bin/python -m qpu_mcp_lab.cli sweep-qaoa-angles --limit 5\n.venv/bin/python -m qpu_mcp_lab.cli submit-micro-frontier \\\n  --backend ibm_fez \\\n  --shots 256 \\\n  --allow-real-qpu\n```\n\nAfter an IBM job completes:\n\n```\n.venv/bin/python -m qpu_mcp_lab.cli quantum-jobs --limit 5\n.venv/bin/python -m qpu_mcp_lab.cli job-result JOB_ID --refresh\n.venv/bin/python -m qpu_mcp_lab.cli decode-job-candidates JOB_ID --top-k 12\n```\n\nThe decoded candidates still need to be tested locally. The QPU suggests; the MacBook judges.\n\nThe local MCP-style server exposes narrow, auditable tools for Codex or other clients. It does not expose arbitrary shell access and it does not return secret values.\n\n```\n./scripts/run_mcp_server.sh\n```\n\nRepresentative tool categories:\n\n`bench_run_config`\n\n`bench_get_best_runs`\n\n`objective_score_run`\n\n`optimizer_build_qubo`\n\n`optimizer_propose_classical_candidates`\n\n`quantum_credential_status`\n\n`quantum_list_backends`\n\n`quantum_submit_micro_frontier_job`\n\n`quantum_decode_job_candidates`\n\nRegenerate the SVG figures:\n\n```\npython3 paper/make_figures.py\n```\n\nGenerated figures:\n\n`paper/figures/throughput_progression.svg`\n\n`paper/figures/qpu_jump.svg`\n\n`paper/figures/quality_boundary.svg`\n\n`paper/figures/prompt_examples.svg`\n\n- Model weights are not included.\n- IBM secrets are not included.\n`config.json`\n\n,`.env`\n\n, logs, SQLite WAL/SHM files, and local model files are ignored by Git.- Real IBM QPU use requires an explicit\n`--allow-real-qpu`\n\nflag. - Publish benchmark claims with command, output, quality gate, context length, page faults, swaps, and system state.\n\nThis project was shaped by:\n\n- Dan Woods' Flash-MoE work on SSD-backed MoE inference\n- Andrej Karpathy's autoresearch loop\n- ByteShape and Potato OS Raspberry Pi Qwen3-30B-A3B demonstrations\n- IBM Quantum and Qiskit Runtime candidate sampling\n- Codex/GPT-5 as the research loop collaborator and experiment agent\n\nSee [CITATION.cff](/Shack870/qwen-air-qpu-mcp-lab/blob/main/CITATION.cff).", "url": "https://wpnews.pro/news/tuning-cpu-only-qwen3-30b-inference-with-an-ibm-quantum-sampling-loop", "canonical_source": "https://github.com/Shack870/qwen-air-qpu-mcp-lab", "published_at": "2026-05-30 01:55:25+00:00", "updated_at": "2026-05-30 02:15:40.174937+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "ai-research", "ai-infrastructure"], "entities": ["IBM Quantum", "Qwen3-30B-A3B-Instruct-2507-GGUF", "GitHub", "Hugging Face", "Intel MacBook Air", "Shack870"], "alternates": {"html": "https://wpnews.pro/news/tuning-cpu-only-qwen3-30b-inference-with-an-ibm-quantum-sampling-loop", "markdown": "https://wpnews.pro/news/tuning-cpu-only-qwen3-30b-inference-with-an-ibm-quantum-sampling-loop.md", "text": "https://wpnews.pro/news/tuning-cpu-only-qwen3-30b-inference-with-an-ibm-quantum-sampling-loop.txt", "jsonld": "https://wpnews.pro/news/tuning-cpu-only-qwen3-30b-inference-with-an-ibm-quantum-sampling-loop.jsonld"}}