# Qiskit QuantumKatas Adapts Microsoft Exercises for LLM Evaluation

> Source: <https://letsdatascience.com/news/qiskit-quantumkatas-adapts-microsoft-exercises-for-llm-evalu-fd02b867>
> Published: 2026-05-27 05:30:42.529543+00:00

# Qiskit QuantumKatas Adapts Microsoft Exercises for LLM Evaluation

Per the arXiv preprint by Juan Cruz-Benito and Ismael Faro, the authors adapt Microsoft's QuantumKatas from Q# to **Qiskit** and provide an evaluation framework for large language model (LLM) assessment (arXiv:2605.27210). The reported benchmark contains **350 tasks** across **26 categories**, each with a natural-language prompt, canonical solution, and deterministic test verification via classical circuit simulation. The paper evaluates **16 LLMs** under **7 prompting configurations**, running a total of **39,200 model runs**; reported best-configuration pass rates range from **32.3% to 83.1%**, with a **26.1 percentage-point** average gap between frontier and open-source models. The authors report models excel at implementing known algorithms (e.g., SimonsAlgorithm **82.1%**, BasicGates **81.6%**) but struggle on problem-encoding tasks (SolveSATWithGrover **34.4%**, DistinguishUnitaries **40.0%**). Per the paper, chain-of-thought prompting has a mixed effect: it helps three models but degrades others, yielding a **56.3%** mean versus **57.8%** for few-shot-5. The benchmark and evaluation artifacts are released alongside the paper (arXiv:2605.27210).

### What happened

Per the arXiv paper by Juan Cruz-Benito and Ismael Faro (arXiv:2605.27210), the authors port Microsoft's **QuantumKatas** from **Q#** to **Qiskit** and package the exercises with an automated evaluation framework for LLM-generated quantum code. The published benchmark comprises **350 tasks** across **26 categories**, covering fundamental gates, Grover's, Simon's, Deutsch-Jozsa, error correction, quantum key distribution, and quantum games. Each task includes a natural-language prompt, a canonical solution, and deterministic test verification implemented via classical circuit simulation. The evaluation in the paper covers **16 LLMs** across **7 prompting configurations**, totaling **39,200 model runs**, with reported best-configuration pass rates between **32.3% and 83.1%** and a **26.1 percentage-point** average gap between frontier and open-source models.

### Editorial analysis - technical context

The paper's use of deterministic classical-circuit simulation and canonical solutions produces an explicit, automatable pass/fail signal, which simplifies large-scale scoring of generated quantum code compared with fuzzier functional tests. Benchmarks that separate algorithm implementation from problem encoding, as this one does, make it easier to diagnose whether a model fails due to quantum-concept reasoning or translation/encoding errors. The reported per-task performance differences (e.g., SimonsAlgorithm **82.1%** vs SolveSATWithGrover **34.4%**) illustrate this distinction in practice.

### Industry context

Adapting an established pedagogical corpus to a widely used framework like **Qiskit** creates a reusable evaluation resource that connects education-grade exercises with LLM evaluation needs. Reported aggregate findings in the paper-such as the **26.1** percentage-point frontier/open-source gap and the mixed effects of chain-of-thought prompting (mean **56.3%**, few-shot-5 **57.8%**)-provide empirical baselines for researchers benchmarking model capabilities on quantum-code generation tasks. These baselines help situate model performance along concrete quantum-programming competencies rather than generic code metrics.

### What to watch

Indicators observers might follow include adoption of the dataset on platforms such as Hugging Face, extension of the suite to hardware-backed verification beyond classical simulation, incorporation of more diverse prompt engineering strategies, and follow-up studies that compare model-generated circuits on real quantum devices. For practitioners, seeing how models perform on problem-encoding versus algorithmic-implementation tasks will be a useful diagnostic when selecting models for quantum-software assistance.

## Scoring Rationale

This paper provides a practical, runnable benchmark that fills a niche for evaluating LLMs on quantum-code generation. It is especially relevant for researchers benchmarking code-generation quality, though its immediate impact is narrower than mainstream model releases.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

[Try 250 free problems](/problems)
