You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#821 demonstrates that the chunked, parallel matrix builder is feasible at production scale on Modal. This is the assignment-driven approach #598 called for: per-chunk mixed-geography Microsimulation instead of cartesian-product precomputation across 51 states × 3,143 counties. Once that path is validated end-to-end, the upstream create_stratified_cps step is no longer load-bearing — it exists primarily to keep precomputation tractable. Removing it lets us keep the current ~5.16M total clone-household columns while replacing repeated clones with true ECPS households, which improves tail coverage and reduces the weight pile-up class of bug surfaced in #555.
Goal
Decommission create_stratified_cps and the per-county precomputation branch. Make --chunked-matrix --parallel the default and only matrix-building path in production, fed by the full source-imputed ECPS rather than a 12K-household stratified subset. Total column count stays at ~5.16M; n_clones is reduced proportionally so the additional records are real survey households rather than copies of the same 12K.
Sequenced milestones (each gates the next)
End-to-end validation of Parallelize chunked matrix building approach (#818) #821 on production scale. After Parallelize chunked matrix building approach (#818) #821 merges, run one gh workflow run pipeline.yaml -f chunked_matrix=true -f parallel_matrix=true -f num_matrix_workers=50 against the existing 12K × 430 stratified pipeline. Acceptance: run completes inside Modal's 14h ceiling; final CSR shape / nnz / row-sums match a serial baseline; resume tested by interrupting and redispatching with the same run_id; wall time and peak RSS recorded in this issue.
Run source imputation on the full ECPS. Today create_source_imputed_cps.py consumes stratified_extended_cps_2024.h5, so SIPP/SCF/ORG variables only exist for the 12K stratified households. Either move source imputation upstream of stratification or keep create_source_imputed_cps.py and have it consume the full ECPS — implementation choice up to whoever takes this on, same outcome either way. Verify the source-imputed full ECPS builds and that imputed-variable distributions match the stratified version on overlapping records.
Build a non-stratified calibration matrix. Drop create_stratified_cps from the pipeline driver. Feed the source-imputed full ECPS to the chunked + parallel builder with n_clones set so total columns ≈ 5.16M (i.e. ceil(5_160_000 / N_ecps_households) — concretely ~85 if ECPS lands near 60K households; pin the exact number once the rebuild lands). No code change to the builder; only the driver and DEFAULT_N_CLONES. Acceptance: matrix builds, calibration converges, and a calibration package is produced.
Quality regression suite — every downstream dataset must be at least as good as today's production path. Compare the non-stratified calibration package and every H5 it produces (national enhanced_cps_2024.h5, all state H5s, the local-area / CD H5s) against today's 12K × 430 baseline. Concretely:
Aggregate ratios (the Uncapped PUF incomes + calibration weights produce ~19x inflated state-level aggregates #555 failure mode): national and state-level totals for employment income, self-employment income, partnership/S-corp income, capital gains, dividends, interest income, rental income — sum of weighted records vs. SOI/published controls. None of the new datasets may regress relative to the current production datasets.
High-AGI tail preservation: weighted-record counts in the $500K–$1M, $1M–$2M, $2M–$5M, $5M–$10M, $10M+ brackets at national and state levels.
Calibration loss / sparsity: final L0 loss and nonzero-gate fraction at the production hyperparameters; report headline targets at which constraints worsened by more than a noise threshold.
Distributional impact on a fixed reform fixture: pick one calibration-stable reform, run the impact through both datasets, compare poverty-rate, decile-impact, and state-level winners/losers.
Local-area / CD report card: rebuild and confirm no new collisions or missing-block warnings; check a handful of CDs by hand.
Acceptance: each downstream dataset is at least as good as today's on every metric above. Any regression blocks the rollout and is filed as its own issue.
Decommission stratification and per-county precomputation. Remove create_stratified_cps.py from the production pipeline (keep the script available as a tool until clearly unused, then delete). Remove COUNTY_DEPENDENT_VARS and the cartesian-product precompute branch from UnifiedMatrixBuilder. Flip --chunked-matrix --parallel to default-on. Update docs/methodology.md and docs/calibration.md to describe the single path. One way to build the matrix, not two.
Why this is worth doing
More true survey records, fewer clones: at ~5.16M columns total, going from 12K × 430 to ~60K × ~85 means ~5× as many distinct CPS households entering calibration. Each calibrated weight is then less likely to be the artifact of a single high-leverage ultra-high-income clone (the Uncapped PUF incomes + calibration weights produce ~19x inflated state-level aggregates #555 failure mode).
Eliminates the precompute scaling wall (Matrix builder precomputation doesn't scale beyond state/county geography levels #598). The chunked builder costs scale with the number of unique (state, county, …) tuples actually assigned, not with the cartesian product. ZIP- and SLD-level targets become tractable as a follow-up rather than blocked-by-architecture.
Simpler pipeline. One matrix-building path, no stratification step, no COUNTY_DEPENDENT_VARS carve-out.
Out of scope
The broader UnifiedMatrixBuilder Phase 4 refactor (extraction of MatrixAssembler / SimulationBatchEvaluator / TargetRepository / ConstraintEvaluator) — separate effort, doesn't gate this one.
Summary
#821 demonstrates that the chunked, parallel matrix builder is feasible at production scale on Modal. This is the assignment-driven approach #598 called for: per-chunk mixed-geography
Microsimulationinstead of cartesian-product precomputation across 51 states × 3,143 counties. Once that path is validated end-to-end, the upstreamcreate_stratified_cpsstep is no longer load-bearing — it exists primarily to keep precomputation tractable. Removing it lets us keep the current ~5.16M total clone-household columns while replacing repeated clones with true ECPS households, which improves tail coverage and reduces the weight pile-up class of bug surfaced in #555.Goal
Decommission
create_stratified_cpsand the per-county precomputation branch. Make--chunked-matrix --parallelthe default and only matrix-building path in production, fed by the full source-imputed ECPS rather than a 12K-household stratified subset. Total column count stays at ~5.16M;n_clonesis reduced proportionally so the additional records are real survey households rather than copies of the same 12K.Sequenced milestones (each gates the next)
End-to-end validation of Parallelize chunked matrix building approach (#818) #821 on production scale. After Parallelize chunked matrix building approach (#818) #821 merges, run one
gh workflow run pipeline.yaml -f chunked_matrix=true -f parallel_matrix=true -f num_matrix_workers=50against the existing 12K × 430 stratified pipeline. Acceptance: run completes inside Modal's 14h ceiling; final CSR shape / nnz / row-sums match a serial baseline; resume tested by interrupting and redispatching with the samerun_id; wall time and peak RSS recorded in this issue.Run source imputation on the full ECPS. Today
create_source_imputed_cps.pyconsumesstratified_extended_cps_2024.h5, so SIPP/SCF/ORG variables only exist for the 12K stratified households. Either move source imputation upstream of stratification or keepcreate_source_imputed_cps.pyand have it consume the full ECPS — implementation choice up to whoever takes this on, same outcome either way. Verify the source-imputed full ECPS builds and that imputed-variable distributions match the stratified version on overlapping records.Build a non-stratified calibration matrix. Drop
create_stratified_cpsfrom the pipeline driver. Feed the source-imputed full ECPS to the chunked + parallel builder withn_clonesset so total columns ≈ 5.16M (i.e.ceil(5_160_000 / N_ecps_households)— concretely ~85 if ECPS lands near 60K households; pin the exact number once the rebuild lands). No code change to the builder; only the driver andDEFAULT_N_CLONES. Acceptance: matrix builds, calibration converges, and a calibration package is produced.Quality regression suite — every downstream dataset must be at least as good as today's production path. Compare the non-stratified calibration package and every H5 it produces (national
enhanced_cps_2024.h5, all state H5s, the local-area / CD H5s) against today's 12K × 430 baseline. Concretely:Acceptance: each downstream dataset is at least as good as today's on every metric above. Any regression blocks the rollout and is filed as its own issue.
Decommission stratification and per-county precomputation. Remove
create_stratified_cps.pyfrom the production pipeline (keep the script available as a tool until clearly unused, then delete). RemoveCOUNTY_DEPENDENT_VARSand the cartesian-product precompute branch fromUnifiedMatrixBuilder. Flip--chunked-matrix --parallelto default-on. Updatedocs/methodology.mdanddocs/calibration.mdto describe the single path. One way to build the matrix, not two.Why this is worth doing
COUNTY_DEPENDENT_VARScarve-out.Out of scope
UnifiedMatrixBuilderPhase 4 refactor (extraction ofMatrixAssembler/SimulationBatchEvaluator/TargetRepository/ConstraintEvaluator) — separate effort, doesn't gate this one.Related