cd /news/large-language-models/tuning-cpu-only-qwen3-30b-inference-… · home topics large-language-models article
[ARTICLE · art-18276] src=github.com pub= topic=large-language-models verified=true sentiment=· neutral

Tuning CPU-only Qwen3-30B inference with an IBM Quantum sampling loop

A developer achieved 14.03 generation tokens per second running the Qwen3-30B-A3B-Instruct Mixture-of-Experts LLM on a 2017 Intel MacBook Air with only 8GB RAM and no GPU, using an IBM Quantum sampling loop to optimize hyperparameters. The quantum-enhanced workflow, which uses the QPU only for candidate selection while the local machine runs inference and judges output quality, represents a 155x improvement over the baseline 0.09 tokens per second. The project's benchmark harness, experiment logs, and paper draft are publicly available on GitHub and Hugging Face.

read5 min publishedMay 30, 2026

Quantum-enhanced autoresearch for high-performance, CPU-only Mixture-of-Experts LLM inference on legacy hardware.

This repository contains the benchmark harness, MCP-style tool boundary, experiment logs, paper draft, and IBM Quantum candidate-sampling workflow from the 2017 Intel MacBook Air Qwen3 MoE project.

Public pages:

The short version:

  • Model: Qwen3-30B-A3B-Instruct-2507-GGUF

,Q3_K_S 2.66bpw

  • Hardware: 2017 Intel MacBook Air, 8GB RAM, CPU-only
  • Context: 16,384 tokens
  • Starting point: about 0.09

generation tokens/sec - Classical systems optimization frontier: 6.49

generation tokens/sec - First IBM Quantum-informed breakthrough: 13.12

generation tokens/sec - Strict quality-gated record: 14.03

generation tokens/sec - Clean-room Codex-off check: 13.91

generation tokens/sec - Speed-only rejected lane: 16.53

generation tokens/sec, not claimed because output coherence failed

This is not a claim that an IBM QPU ran Qwen. It did not.

The core contribution is the synchronized loop:

Human Experimenter sets the goal and constraints
    -> Codex proposes, edits, runs, logs, and interprets experiments
    -> the MacBook runs real llama.cpp inference and judges candidates
    -> the local database scores the run frontier
    -> compact candidate choices are compressed into QUBO form
    -> IBM Quantum samples candidate bitstrings
    -> Codex decodes those bitstrings into concrete llama.cpp configs
    -> the MacBook tests them
    -> the loop repeats

The QPU improves the research loop's candidate selection. The MacBook remains the judge. The model remains local. The result is a small hybrid quantum optimization lab for routed MoE inference.

See the paper draft:

Quantum-Enhanced Hyperparameter Tuning for High-Performance On-Device CPU-Only Inference of Mixture-of-Experts LLMs on Legacy HardwareGenerated preprint PDF

paper/

  • paper draft, selected run snapshots, and generated SVG figurespaper/data/qpu_lab_public.sqlite

  • sanitized public SQLite benchmark and QPU job databasepaper/data/public_runs.csv

  • sanitized public run log powering the Space dashboardqpu_mcp_lab/

  • benchmark harness, objective scorer, optimizer, QUBO builder, IBM Quantum adapter, and MCP-style serverhuggingface/space/

  • Gradio leaderboard and config explorer sourcescripts/

  • experiment drivers and reproducibility scriptsdocs/REPRODUCIBILITY.md

  • validation protocoldocs/COMMUNITY_VALIDATION.md

  • guide for outside benchmark reportsdocs/HUGGINGFACE_BLOG_DRAFT.md

  • draft article for the Hugging Face Blog editordocs/PRESS_KIT.md

  • concise public launch materialdocs/RESULTS.md

  • result narrative and milestone summarySECURITY.md

  • secret handling and QPU guardrailsconfig.example.json

  • local config template

This repo does not include model weights or a compiled llama-cli

.

You need:

  • Python 3.11 or newer
  • a local llama-cli

or compatible fork build - the ByteShape GGUF model file: Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf

  • optional IBM Quantum credentials for real QPU jobs

Reference local paths from the original lab:

~/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli
~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf
git clone https://github.com/Shack870/qwen-air-qpu-mcp-lab.git
cd qwen-air-qpu-mcp-lab

python3 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

cp config.example.json config.json

Edit config.json

:

{
  "llama_bin": "~/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli",
  "model_path": "~/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf",
  "llama_repo": "~/src/ik_llama.cpp",
  "safe_memory_gb": 6.5,
  "default_backend": "local-simulator",
  "allow_real_qpu_jobs_by_default": false
}

You can also provide paths through environment variables:

export QPU_MCP_LAB_LLAMA_BIN="$HOME/src/ik_llama.cpp/build-air-iqk-lean/bin/llama-cli"
export QPU_MCP_LAB_MODEL_PATH="$HOME/qwen-air-tests/models/byteshape-qwen3-30b-a3b-2507/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.66bpw.gguf"

Validate the environment:

.venv/bin/python scripts/validate_environment.py

Initialize the database:

.venv/bin/python -m qpu_mcp_lab.cli init-db

Run the record-family config:

.venv/bin/python -m qpu_mcp_lab.cli run --config-json '{
  "label": "strict_record_reproduction",
  "prompt": "<|im_start|>user\nContinue this comma-separated list of Mars facts: red planet, thin atmosphere,<|im_end|>\n<|im_start|>assistant\n",
  "ctx_size": 16384,
  "batch_size": 2456,
  "ubatch_size": 144,
  "threads": 4,
  "threads_batch": 4,
  "cache_type_k": "q6_0",
  "cache_type_v": "q6_0",
  "flash_attn": true,
  "smart_expert_reduction": "3,1",
  "env_veclib_threads": 1,
  "env_omp_wait_policy": "ACTIVE",
  "env_omp_dynamic": "FALSE",
  "env_ser_cheap_ranges": "24:30",
  "env_ser_cheap_min": 2,
  "env_ser_cheap_thresh": 1.0,
  "n_predict": 128,
  "temp": 0.0,
  "ignore_eos": true,
  "no_display_prompt": true,
  "timeout_seconds": 420
}'

Reference results from the original machine:

  • strict record: 14.03 tok/s

  • clean-room lane: 13.91 tok/s

  • first QPU-informed jump: 13.12 tok/s

  • classical frontier before QPU sampling: 6.49 tok/s

  • original proof-of-life baseline: about 0.09 tok/s

Exact repeats vary with thermals, page-cache state, context switches, and prompt shape. Report both throughput and output quality.

A speed result is not a quality result unless the output remains coherent.

The strict gate used short factual/code prompts such as:

What is the capital of Serbia?

What is the capital of Mars?

Write a compact Python function named is_prime that checks whether n is prime.

Known pattern:

  • broad speed-only expert reductions can produce high tokens/sec and broken text
  • the accepted record lane is lower than the fastest raw lane because it preserves coherence

Do not put IBM API keys in Git, config.json

, .env

, shell history, screenshots, paper drafts, logs, or chat messages.

Preferred macOS setup:

./scripts/store_ibm_key.sh

That script prompts for the key without echoing it and stores it in macOS Keychain under:

ibm_quantum_api_key

  • optional ibm_quantum_instance_crn

The harness reads credentials in this order:

IBM_QUANTUM_API_KEY

, then Keychain serviceibm_quantum_api_key

IBM_QUANTUM_INSTANCE

, then Keychain serviceibm_quantum_instance_crn

Temporary environment-variable setup also works:

export IBM_QUANTUM_API_KEY="paste-token-here"
export IBM_QUANTUM_INSTANCE="optional-instance-or-crn"

For safety, Keychain storage is preferred.

Check credential status without printing secrets:

.venv/bin/python -m qpu_mcp_lab.cli quantum-credentials

List available IBM backends:

.venv/bin/python -m qpu_mcp_lab.cli quantum-backends

Real QPU submission is guarded. The harness defaults to dry-run or local simulation unless the command includes --allow-real-qpu

.

Example guarded workflow:

.venv/bin/python -m qpu_mcp_lab.cli build-qubo
.venv/bin/python -m qpu_mcp_lab.cli sweep-qaoa-angles --limit 5
.venv/bin/python -m qpu_mcp_lab.cli submit-micro-frontier \
  --backend ibm_fez \
  --shots 256 \
  --allow-real-qpu

After an IBM job completes:

.venv/bin/python -m qpu_mcp_lab.cli quantum-jobs --limit 5
.venv/bin/python -m qpu_mcp_lab.cli job-result JOB_ID --refresh
.venv/bin/python -m qpu_mcp_lab.cli decode-job-candidates JOB_ID --top-k 12

The decoded candidates still need to be tested locally. The QPU suggests; the MacBook judges.

The local MCP-style server exposes narrow, auditable tools for Codex or other clients. It does not expose arbitrary shell access and it does not return secret values.

./scripts/run_mcp_server.sh

Representative tool categories:

bench_run_config

bench_get_best_runs

objective_score_run

optimizer_build_qubo

optimizer_propose_classical_candidates

quantum_credential_status

quantum_list_backends

quantum_submit_micro_frontier_job

quantum_decode_job_candidates

Regenerate the SVG figures:

python3 paper/make_figures.py

Generated figures:

paper/figures/throughput_progression.svg

paper/figures/qpu_jump.svg

paper/figures/quality_boundary.svg

paper/figures/prompt_examples.svg

  • Model weights are not included.
  • IBM secrets are not included. config.json

,.env

, logs, SQLite WAL/SHM files, and local model files are ignored by Git.- Real IBM QPU use requires an explicit --allow-real-qpu

flag. - Publish benchmark claims with command, output, quality gate, context length, page faults, swaps, and system state.

This project was shaped by:

  • Dan Woods' Flash-MoE work on SSD-backed MoE inference
  • Andrej Karpathy's autoresearch loop
  • ByteShape and Potato OS Raspberry Pi Qwen3-30B-A3B demonstrations
  • IBM Quantum and Qiskit Runtime candidate sampling
  • Codex/GPT-5 as the research loop collaborator and experiment agent

See CITATION.cff.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/tuning-cpu-only-qwen…] indexed:0 read:5min 2026-05-30 ·