Skip to content

lean-dojo/QuantumLean-Bench

Repository files navigation

QuantumLean-Bench

QuantumLean-Bench is the benchmark repo for the paper "QuantumLean-Bench: A Unified Benchmark for Informal and Formal Quantum Reasoning", appearing at the AI4Physics Workshop at ICML 2026 (AI4Physics@ICML 2026, Seoul, South Korea).

Authors: Isha Goswami, Anushka Paulchoudhury, Robert Joseph George, and Anima Anandkumar.

Paper: OpenReview

Abstract

QuantumLean-Bench is a unified benchmark for quantum-science reasoning across informal natural-language solutions and Lean-oriented formal representations. It contains 931 problems spanning chemistry, computing, information theory, physics, and quantum machine learning. Each problem pairs an informal response slot with a formal response slot, making it possible to measure how much mathematical structure survives when a model moves from explanation to formalization.

This repo includes the benchmark data, prompt templates, evaluation scripts, and Lean scaffolds needed to inspect and run the benchmark directly.

What is in this repo:

  • QuantumLean/BenchmarkData/: benchmark datasets, prompt templates, and manual evaluation tables.
  • benchmarking/core/: shared Python helpers for paths, prompts, schemas, dataset IO, and LLM clients.
  • benchmarking/eval/: local evaluation entrypoints and Lean checking.
  • benchmarking/dataset/: JSON validation and dataset maintenance scripts.
  • benchmarking/formal/: Lean formal scaffolds inlined into formal prompts.
  • benchmarking/providers/modal/: optional Modal apps for hosted model endpoints.

Lean dependencies are pinned to Lean v4.24.0 and include Mathlib plus PhysLean for physics-oriented formalization experiments.

Setup

Install Lean with elan, then fetch Lake dependencies:

lake update
lake build QuantumLean

For Python tooling, use the requirement files under QuantumLean/BenchmarkData/:

python3 -m pip install -r QuantumLean/BenchmarkData/requirements-data.txt
python3 -m pip install -r QuantumLean/BenchmarkData/requirements-llm.txt

Validate Data

Run the schema validator before changing dataset files.

python3 benchmarking/dataset/validate.py

Check Lean Snippets

For formal runs, typecheck generated snippets with the pinned Lean environment.

Inline code:

python3 benchmarking/eval/lean_verify.py --code 'example : (1 : Nat) = 1 := rfl'

Run Benchmarking

The informal runner writes model responses back into the dataset JSON files under model-specific keys. Use a fresh --response-key for each run.

Informal generation against an OpenAI-compatible endpoint:

OPENAI_API_KEY=... python3 benchmarking/eval/run_informal.py \
  --datasets QuantumLean/BenchmarkData/Computing/qiskit-quantum-katas_problems.json \
  --base-url https://openrouter.ai/api \
  --model openrouter/free \
  --response-key openrouter_free_response \
  --write-meta

Formal generation plus Lean checking:

python3 benchmarking/eval/run_formal.py \
  --problems QuantumLean/BenchmarkData/Computing/qiskit-quantum-katas_problems.json \
  --out .benchmark_tmp/results.jsonl \
  --backend openai_compat \
  --base-url https://example-openai-compatible-endpoint/v1 \
  --model model-name

More detailed Modal/OpenRouter examples are in benchmarking/README.md.

About

The first unified benchmark for quantum-science reasoning across informal natural-language solutions and Lean-oriented formal representations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors