Move calibration to a non-stratified, chunked-builder production path (follow-up to #818 / #821)

## Summary

#821 demonstrates that the chunked, parallel matrix builder is feasible at production scale on Modal. This is the assignment-driven approach #598 called for: per-chunk mixed-geography `Microsimulation` instead of cartesian-product precomputation across 51 states × 3,143 counties. Once that path is validated end-to-end, the upstream `create_stratified_cps` step is no longer load-bearing — it exists primarily to keep precomputation tractable. Removing it lets us keep the current ~5.16M total clone-household columns while replacing repeated clones with **true** ECPS households, which improves tail coverage and reduces the weight pile-up class of bug surfaced in #555.

## Goal

Decommission `create_stratified_cps` and the per-county precomputation branch. Make `--chunked-matrix --parallel` the default and only matrix-building path in production, fed by the full source-imputed ECPS rather than a 12K-household stratified subset. Total column count stays at ~5.16M; `n_clones` is reduced proportionally so the additional records are real survey households rather than copies of the same 12K.

## Sequenced milestones (each gates the next)

1. **End-to-end validation of #821 on production scale.** After #821 merges, run one `gh workflow run pipeline.yaml -f chunked_matrix=true -f parallel_matrix=true -f num_matrix_workers=50` against the existing 12K × 430 stratified pipeline. Acceptance: run completes inside Modal's 14h ceiling; final CSR shape / nnz / row-sums match a serial baseline; resume tested by interrupting and redispatching with the same `run_id`; wall time and peak RSS recorded in this issue.

2. **Run source imputation on the full ECPS.** Today `create_source_imputed_cps.py` consumes `stratified_extended_cps_2024.h5`, so SIPP/SCF/ORG variables only exist for the 12K stratified households. Either move source imputation upstream of stratification or keep `create_source_imputed_cps.py` and have it consume the full ECPS — implementation choice up to whoever takes this on, same outcome either way. Verify the source-imputed full ECPS builds and that imputed-variable distributions match the stratified version on overlapping records.

3. **Build a non-stratified calibration matrix.** Drop `create_stratified_cps` from the pipeline driver. Feed the source-imputed full ECPS to the chunked + parallel builder with `n_clones` set so total columns ≈ 5.16M (i.e. `ceil(5_160_000 / N_ecps_households)` — concretely ~85 if ECPS lands near 60K households; pin the exact number once the rebuild lands). No code change to the builder; only the driver and `DEFAULT_N_CLONES`. Acceptance: matrix builds, calibration converges, and a calibration package is produced.

4. **Quality regression suite — every downstream dataset must be at least as good as today's production path.** Compare the non-stratified calibration package and *every* H5 it produces (national `enhanced_cps_2024.h5`, all state H5s, the local-area / CD H5s) against today's 12K × 430 baseline. Concretely:
   - **Aggregate ratios** (the #555 failure mode): national and state-level totals for employment income, self-employment income, partnership/S-corp income, capital gains, dividends, interest income, rental income — sum of weighted records vs. SOI/published controls. None of the new datasets may regress relative to the current production datasets.
   - **High-AGI tail preservation**: weighted-record counts in the \$500K–\$1M, \$1M–\$2M, \$2M–\$5M, \$5M–\$10M, \$10M+ brackets at national and state levels.
   - **Calibration loss / sparsity**: final L0 loss and nonzero-gate fraction at the production hyperparameters; report headline targets at which constraints worsened by more than a noise threshold.
   - **Distributional impact on a fixed reform fixture**: pick one calibration-stable reform, run the impact through both datasets, compare poverty-rate, decile-impact, and state-level winners/losers.
   - **Local-area / CD report card**: rebuild and confirm no new collisions or missing-block warnings; check a handful of CDs by hand.

   Acceptance: each downstream dataset is at least as good as today's on every metric above. Any regression blocks the rollout and is filed as its own issue.

5. **Decommission stratification and per-county precomputation.** Remove `create_stratified_cps.py` from the production pipeline (keep the script available as a tool until clearly unused, then delete). Remove `COUNTY_DEPENDENT_VARS` and the cartesian-product precompute branch from `UnifiedMatrixBuilder`. Flip `--chunked-matrix --parallel` to default-on. Update `docs/methodology.md` and `docs/calibration.md` to describe the single path. One way to build the matrix, not two.

## Why this is worth doing

- More true survey records, fewer clones: at ~5.16M columns total, going from 12K × 430 to ~60K × ~85 means ~5× as many distinct CPS households entering calibration. Each calibrated weight is then less likely to be the artifact of a single high-leverage ultra-high-income clone (the #555 failure mode).
- Eliminates the precompute scaling wall (#598). The chunked builder costs scale with the number of *unique* (state, county, …) tuples actually assigned, not with the cartesian product. ZIP- and SLD-level targets become tractable as a follow-up rather than blocked-by-architecture.
- Simpler pipeline. One matrix-building path, no stratification step, no `COUNTY_DEPENDENT_VARS` carve-out.

## Out of scope

- The broader `UnifiedMatrixBuilder` Phase 4 refactor (extraction of `MatrixAssembler` / `SimulationBatchEvaluator` / `TargetRepository` / `ConstraintEvaluator`) — separate effort, doesn't gate this one.
- New target geographies (ZIP, SLD, school district). Unblocked by this work but tracked in #598 / future issues.
- Changes to L0 hyperparameters or target config beyond what milestone 4 needs to keep regressions away.

## Related

- #818 / #821 — the parallel chunked builder this issue is the logical follow-up to.
- #762 / #753 — the chunked builder itself.
- #598 — the strategic case for assignment-driven precomputation.
- #555 — the regression class milestone 4 must defend against.
- #486 — the precedent for "drop the stratification, calibrate from a flat national pool."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move calibration to a non-stratified, chunked-builder production path (follow-up to #818 / #821) #851

Summary

Goal

Sequenced milestones (each gates the next)

Why this is worth doing

Out of scope

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Move calibration to a non-stratified, chunked-builder production path (follow-up to #818 / #821) #851

Description

Summary

Goal

Sequenced milestones (each gates the next)

Why this is worth doing

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions