Skip to content

Move calibration to a non-stratified, chunked-builder production path (follow-up to #818 / #821) #851

@juaristi22

Description

@juaristi22

Summary

#821 demonstrates that the chunked, parallel matrix builder is feasible at production scale on Modal. This is the assignment-driven approach #598 called for: per-chunk mixed-geography Microsimulation instead of cartesian-product precomputation across 51 states × 3,143 counties. Once that path is validated end-to-end, the upstream create_stratified_cps step is no longer load-bearing — it exists primarily to keep precomputation tractable. Removing it lets us keep the current ~5.16M total clone-household columns while replacing repeated clones with true ECPS households, which improves tail coverage and reduces the weight pile-up class of bug surfaced in #555.

Goal

Decommission create_stratified_cps and the per-county precomputation branch. Make --chunked-matrix --parallel the default and only matrix-building path in production, fed by the full source-imputed ECPS rather than a 12K-household stratified subset. Total column count stays at ~5.16M; n_clones is reduced proportionally so the additional records are real survey households rather than copies of the same 12K.

Sequenced milestones (each gates the next)

  1. End-to-end validation of Parallelize chunked matrix building approach (#818) #821 on production scale. After Parallelize chunked matrix building approach (#818) #821 merges, run one gh workflow run pipeline.yaml -f chunked_matrix=true -f parallel_matrix=true -f num_matrix_workers=50 against the existing 12K × 430 stratified pipeline. Acceptance: run completes inside Modal's 14h ceiling; final CSR shape / nnz / row-sums match a serial baseline; resume tested by interrupting and redispatching with the same run_id; wall time and peak RSS recorded in this issue.

  2. Run source imputation on the full ECPS. Today create_source_imputed_cps.py consumes stratified_extended_cps_2024.h5, so SIPP/SCF/ORG variables only exist for the 12K stratified households. Either move source imputation upstream of stratification or keep create_source_imputed_cps.py and have it consume the full ECPS — implementation choice up to whoever takes this on, same outcome either way. Verify the source-imputed full ECPS builds and that imputed-variable distributions match the stratified version on overlapping records.

  3. Build a non-stratified calibration matrix. Drop create_stratified_cps from the pipeline driver. Feed the source-imputed full ECPS to the chunked + parallel builder with n_clones set so total columns ≈ 5.16M (i.e. ceil(5_160_000 / N_ecps_households) — concretely ~85 if ECPS lands near 60K households; pin the exact number once the rebuild lands). No code change to the builder; only the driver and DEFAULT_N_CLONES. Acceptance: matrix builds, calibration converges, and a calibration package is produced.

  4. Quality regression suite — every downstream dataset must be at least as good as today's production path. Compare the non-stratified calibration package and every H5 it produces (national enhanced_cps_2024.h5, all state H5s, the local-area / CD H5s) against today's 12K × 430 baseline. Concretely:

    • Aggregate ratios (the Uncapped PUF incomes + calibration weights produce ~19x inflated state-level aggregates #555 failure mode): national and state-level totals for employment income, self-employment income, partnership/S-corp income, capital gains, dividends, interest income, rental income — sum of weighted records vs. SOI/published controls. None of the new datasets may regress relative to the current production datasets.
    • High-AGI tail preservation: weighted-record counts in the $500K–$1M, $1M–$2M, $2M–$5M, $5M–$10M, $10M+ brackets at national and state levels.
    • Calibration loss / sparsity: final L0 loss and nonzero-gate fraction at the production hyperparameters; report headline targets at which constraints worsened by more than a noise threshold.
    • Distributional impact on a fixed reform fixture: pick one calibration-stable reform, run the impact through both datasets, compare poverty-rate, decile-impact, and state-level winners/losers.
    • Local-area / CD report card: rebuild and confirm no new collisions or missing-block warnings; check a handful of CDs by hand.

    Acceptance: each downstream dataset is at least as good as today's on every metric above. Any regression blocks the rollout and is filed as its own issue.

  5. Decommission stratification and per-county precomputation. Remove create_stratified_cps.py from the production pipeline (keep the script available as a tool until clearly unused, then delete). Remove COUNTY_DEPENDENT_VARS and the cartesian-product precompute branch from UnifiedMatrixBuilder. Flip --chunked-matrix --parallel to default-on. Update docs/methodology.md and docs/calibration.md to describe the single path. One way to build the matrix, not two.

Why this is worth doing

Out of scope

  • The broader UnifiedMatrixBuilder Phase 4 refactor (extraction of MatrixAssembler / SimulationBatchEvaluator / TargetRepository / ConstraintEvaluator) — separate effort, doesn't gate this one.
  • New target geographies (ZIP, SLD, school district). Unblocked by this work but tracked in Matrix builder precomputation doesn't scale beyond state/county geography levels #598 / future issues.
  • Changes to L0 hyperparameters or target config beyond what milestone 4 needs to keep regressions away.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions