QuantumLean-Bench

QuantumLean-Bench is the benchmark repo for the paper "QuantumLean-Bench: A Unified Benchmark for Informal and Formal Quantum Reasoning", appearing at the AI4Physics Workshop at ICML 2026 (AI4Physics@ICML 2026, Seoul, South Korea).

Authors: Isha Goswami, Anushka Paulchoudhury, Robert Joseph George, and Anima Anandkumar.

Paper: OpenReview

Abstract

QuantumLean-Bench is a unified benchmark for quantum-science reasoning across informal natural-language solutions and Lean-oriented formal representations. It contains 931 problems spanning chemistry, computing, information theory, physics, and quantum machine learning. Each problem pairs an informal response slot with a formal response slot, making it possible to measure how much mathematical structure survives when a model moves from explanation to formalization.

This repo includes the benchmark data, prompt templates, evaluation scripts, and Lean scaffolds needed to inspect and run the benchmark directly.

What is in this repo:

QuantumLean/BenchmarkData/: benchmark datasets, prompt templates, and manual evaluation tables.
benchmarking/core/: shared Python helpers for paths, prompts, schemas, dataset IO, and LLM clients.
benchmarking/eval/: local evaluation entrypoints and Lean checking.
benchmarking/dataset/: JSON validation and dataset maintenance scripts.
benchmarking/formal/: Lean formal scaffolds inlined into formal prompts.
benchmarking/providers/modal/: optional Modal apps for hosted model endpoints.

Lean dependencies are pinned to Lean v4.24.0 and include Mathlib plus PhysLean for physics-oriented formalization experiments.

Setup

Install Lean with elan, then fetch Lake dependencies:

lake update
lake build QuantumLean

For Python tooling, use the requirement files under QuantumLean/BenchmarkData/:

python3 -m pip install -r QuantumLean/BenchmarkData/requirements-data.txt
python3 -m pip install -r QuantumLean/BenchmarkData/requirements-llm.txt

Validate Data

Run the schema validator before changing dataset files.

python3 benchmarking/dataset/validate.py

Check Lean Snippets

For formal runs, typecheck generated snippets with the pinned Lean environment.

Inline code:

python3 benchmarking/eval/lean_verify.py --code 'example : (1 : Nat) = 1 := rfl'

Run Benchmarking

The informal runner writes model responses back into the dataset JSON files under model-specific keys. Use a fresh --response-key for each run.

Informal generation against an OpenAI-compatible endpoint:

OPENAI_API_KEY=... python3 benchmarking/eval/run_informal.py \
  --datasets QuantumLean/BenchmarkData/Computing/qiskit-quantum-katas_problems.json \
  --base-url https://openrouter.ai/api \
  --model openrouter/free \
  --response-key openrouter_free_response \
  --write-meta

Formal generation plus Lean checking:

python3 benchmarking/eval/run_formal.py \
  --problems QuantumLean/BenchmarkData/Computing/qiskit-quantum-katas_problems.json \
  --out .benchmark_tmp/results.jsonl \
  --backend openai_compat \
  --base-url https://example-openai-compatible-endpoint/v1 \
  --model model-name

More detailed Modal/OpenRouter examples are in benchmarking/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
QuantumLean		QuantumLean
benchmarking		benchmarking
.gitignore		.gitignore
LICENSE		LICENSE
QuantumLean.lean		QuantumLean.lean
README.md		README.md
lake-manifest.json		lake-manifest.json
lakefile.toml		lakefile.toml
lean-toolchain		lean-toolchain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

QuantumLean-Bench

Abstract

Setup

Validate Data

Check Lean Snippets

Run Benchmarking

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

QuantumLean-Bench

Abstract

Setup

Validate Data

Check Lean Snippets

Run Benchmarking

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages