PolicyBench

How well can frontier models calculate tax and benefit outcomes without tools?

PolicyBench measures how well frontier AI models estimate selected household tax and benefit outputs without tools.

For benchmark scope, snapshot policy, and terminology, see the benchmark card.

US benchmark scenarios are sampled from Enhanced CPS households and evaluated under tax year 2026 rules with PolicyEngine-US. The public UK path uses a UK-calibrated transfer dataset and PolicyEngine-UK reference outputs for fiscal year 2026-27.

Condition

AI alone: Models estimate tax/benefit values using only their training knowledge

Benchmark scope

Benchmark outputs are defined in policybench/benchmark_specs.json. New CLI runs default to headline, which focuses the main ranking on person- or household-facing outputs that contribute to household net income. PolicyEngine variables may be native to lower-level entities, but benchmark outputs are either expanded to people shown in the prompt or aggregated to the household before scoring. Coverage outputs are binary flags in the headline ranking; the separate household-equal impact score uses PolicyEngine value proxies to give those flags a dollar-scale weight. Intermediate tax bases and payroll subcomponents are excluded from the headline ranking. WIC is requested as person-level eligibility, not as a dollar amount.

Programs evaluated

The current public release covers selected federal taxes, credits, benefits, health-related support, coverage labels, and state-tax outputs in the US, plus selected tax and transfer outputs in the UK. US federal income tax is scored as a compact decomposition: tax after nonrefundable credits and before refundable credits, plus refundable federal credits excluding the ACA Premium Tax Credit. The ACA Premium Tax Credit is scored separately as a health-related output.

Quick start

pip install policybench
policybench --help

For repository development, clone the full Git repository before running tests:

pip install -e ".[dev]"
pytest

Benchmark run

# Generate reference outputs for 100 sampled households using headline outputs
policybench reference-outputs -n 100 --seed 42

# Run AI-alone evaluations on the exported scenario manifest.
# The standard response contract includes numeric answers and explanations.
policybench eval-no-tools -n 100 --seed 42

# For larger runs, use resumable per-model chunks.
policybench eval-no-tools-chunked \
  --scenario-manifest results/local/scenarios.csv \
  --output-dir results/local/no_tools_chunked \
  --country us \
  --chunk-size 10 \
  --parallel 2

# Analyze local results and export local artifacts
policybench analyze --output-dir results/local/analysis

Repeated runs

# Optional: run the same benchmark multiple times on the saved scenario manifest
policybench eval-no-tools-repeated -n 100 --seed 42 --repeats 3 -o results/local/no_tools/runs

# Analyze the canonical point estimate plus across-run stability
policybench analyze --runs-dir results/local/no_tools/runs --output-dir results/local/analysis

policybench reference-outputs writes PolicyEngine reference outputs, not administrative truth. It also writes results/local/scenarios.csv, and the eval commands reuse that manifest by default instead of regenerating households from the current source dataset. Prediction CSVs also get a .meta.json sidecar so resumes only happen against the exact same manifest, model set, and program set.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
app		app
docs		docs
paper		paper
policybench		policybench
results		results
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RESULTS.md		RESULTS.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PolicyBench

Condition

Benchmark scope

Programs evaluated

Quick start

Benchmark run

Repeated runs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PolicyBench

Condition

Benchmark scope

Programs evaluated

Quick start

Benchmark run

Repeated runs

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages