QuantumLean-Bench is the benchmark repo for the paper
"QuantumLean-Bench: A Unified Benchmark for Informal and Formal Quantum
Reasoning", appearing at the AI4Physics Workshop at ICML 2026
(AI4Physics@ICML 2026, Seoul, South Korea).
Authors: Isha Goswami, Anushka Paulchoudhury, Robert Joseph George, and Anima Anandkumar.
Paper: OpenReview
QuantumLean-Bench is a unified benchmark for quantum-science reasoning across informal natural-language solutions and Lean-oriented formal representations. It contains 931 problems spanning chemistry, computing, information theory, physics, and quantum machine learning. Each problem pairs an informal response slot with a formal response slot, making it possible to measure how much mathematical structure survives when a model moves from explanation to formalization.
This repo includes the benchmark data, prompt templates, evaluation scripts, and Lean scaffolds needed to inspect and run the benchmark directly.
What is in this repo:
QuantumLean/BenchmarkData/: benchmark datasets, prompt templates, and manual evaluation tables.benchmarking/core/: shared Python helpers for paths, prompts, schemas, dataset IO, and LLM clients.benchmarking/eval/: local evaluation entrypoints and Lean checking.benchmarking/dataset/: JSON validation and dataset maintenance scripts.benchmarking/formal/: Lean formal scaffolds inlined into formal prompts.benchmarking/providers/modal/: optional Modal apps for hosted model endpoints.
Lean dependencies are pinned to Lean v4.24.0 and include Mathlib plus PhysLean
for physics-oriented formalization experiments.
Install Lean with elan, then fetch Lake dependencies:
lake update
lake build QuantumLeanFor Python tooling, use the requirement files under QuantumLean/BenchmarkData/:
python3 -m pip install -r QuantumLean/BenchmarkData/requirements-data.txt
python3 -m pip install -r QuantumLean/BenchmarkData/requirements-llm.txtRun the schema validator before changing dataset files.
python3 benchmarking/dataset/validate.pyFor formal runs, typecheck generated snippets with the pinned Lean environment.
Inline code:
python3 benchmarking/eval/lean_verify.py --code 'example : (1 : Nat) = 1 := rfl'The informal runner writes model responses back into the dataset JSON files under
model-specific keys. Use a fresh --response-key for each run.
Informal generation against an OpenAI-compatible endpoint:
OPENAI_API_KEY=... python3 benchmarking/eval/run_informal.py \
--datasets QuantumLean/BenchmarkData/Computing/qiskit-quantum-katas_problems.json \
--base-url https://openrouter.ai/api \
--model openrouter/free \
--response-key openrouter_free_response \
--write-metaFormal generation plus Lean checking:
python3 benchmarking/eval/run_formal.py \
--problems QuantumLean/BenchmarkData/Computing/qiskit-quantum-katas_problems.json \
--out .benchmark_tmp/results.jsonl \
--backend openai_compat \
--base-url https://example-openai-compatible-endpoint/v1 \
--model model-nameMore detailed Modal/OpenRouter examples are in benchmarking/README.md.