|
| 1 | +--- |
| 2 | +phase: "12-dlrm-dataset-columns" |
| 3 | +created_at: "2026-02-25T17:54:00Z" |
| 4 | +status: planning |
| 5 | +--- |
| 6 | + |
| 7 | +# Phase Context: DLRM Dataset Columns |
| 8 | + |
| 9 | +## Goal |
| 10 | +Expand the DLRM workload from 3 packed columns to 200 individual columns with mixed dtypes (int8, float16, float32, float64), where 40 randomly distributed columns are read during the workload and 160 are not. |
| 11 | + |
| 12 | +## Requirements Covered |
| 13 | +- TRAIN-07: Expand DLRM parquet columns to 200 individual features with mixed dtypes and selective column reading |
| 14 | + |
| 15 | +## Success Criteria |
| 16 | +1. DLRM parquet data generation produces files with 200 individual columns using int8, float16, float32, and float64 dtypes |
| 17 | +2. DLRM benchmark reads only the 40 columns marked `read: true`, totaling exactly 160 bytes per record |
| 18 | +3. The 40 read columns and 160 unread columns are randomly distributed across all 200 columns (not grouped by index) |
| 19 | +4. End-to-end verification passes: generate parquet data with 200-column config and read it back programmatically confirming all dtypes work correctly |
| 20 | + |
| 21 | +## Locked Decisions |
| 22 | +These decisions are NON-NEGOTIABLE. All plans and execution must respect them. |
| 23 | + |
| 24 | +| Decision | Value | Rationale | |
| 25 | +|----------|-------|-----------| |
| 26 | +| Total column count | 200 columns (`feature_000`–`feature_199`), each size 1 | Expand from 3 packed columns to individual feature columns | |
| 27 | +| Read column count | 40 columns with `read: true`, randomly distributed across all 200 | Simulate realistic column access patterns (not sequential grouping) | |
| 28 | +| Unread column count | 160 columns with `read: false`, randomly distributed (complement of read) | Simulate real DLRM wide-table with selective column reads | |
| 29 | +| Read columns byte total | Read columns must total exactly 160 bytes | Preserve original workload I/O characteristics | |
| 30 | +| Read column dtype pattern | Categoricals use mix of int8/float16/float32/float64; numericals use float32 — weighted toward real DLRM patterns | Realistic DLRM representation | |
| 31 | +| `record_length_bytes` value | Total bytes of all 200 columns (not just read columns) | Represents full on-disk record size | |
| 32 | +| Scalar PyArrow types for size=1 | Use scalar `pa.int8()`, `pa.float16()`, `pa.float32()`, `pa.float64()` — not `FixedSizeListArray` for size=1 | Optimize read throughput, minimize copies | |
| 33 | +| Add int8 and float16 dtype support | Add to DLIO parquet generator `_build_schema()` and `_generate_column_data_batch()` | Required for mixed-dtype columns | |
| 34 | +| Verify and fix ParquetReader | Ensure ParquetReader handles int8 and float16 dtypes correctly; fix if needed | Reader must handle all new dtypes | |
| 35 | +| Default size to 1 | DLIO parquet generator and reader default `size` to 1 when not specified in config | Simplify config, reduce verbosity | |
| 36 | +| Omit `size: 1` from configs | Config files should not include `size: 1` — only specify `size` when it differs from default | Keep 200-column configs compact | |
| 37 | +| E2E verification required | Generate parquet data with 200-column config and read it back programmatically | Confirm full pipeline works | |
| 38 | +| No validation rule changes | mlpstorage validation rules do not need updates for this phase | Existing rules sufficient | |
| 39 | + |
| 40 | +## Deferred Decisions |
| 41 | +These decisions should NOT be made during this phase. They belong to a later phase. |
| 42 | + |
| 43 | +- Exact dtype distribution for the 26 categorical read columns (planner determines, constrained so all 40 read columns total 160 bytes) |
| 44 | +- Exact dtype distribution for the 160 unread columns (any reasonable mix of int8/float16/float32/float64) |
| 45 | +- Which specific column indices get `read: true` (planner uses a seed for reproducibility) |
| 46 | + |
| 47 | +## Assumptions |
| 48 | +These are assumed true unless explicitly contradicted by the user or requirements. |
| 49 | + |
| 50 | +- All 3 DLRM config files (`dlrm_b200.yaml`, `dlrm_mi355.yaml`, `dlrm_datagen.yaml`) get identical column definitions |
| 51 | +- The `read` flag filtering in `config.py` and `parquet_reader.py` already works correctly (no changes needed to filtering logic) |
| 52 | +- PyArrow supports `pa.int8()` and `pa.float16()` natively in the installed version |
| 53 | +- Existing configs with explicit `size` values continue to work (backward compatible default) |
| 54 | +- The original DLRM workload had 1 label + 13 numerical + 26 categorical = 40 values at 4 bytes each = 160 bytes |
| 55 | + |
| 56 | +## Anti-Goals |
| 57 | +This phase explicitly does NOT do the following. Do not add tasks for these. |
| 58 | + |
| 59 | +- No changes to mlpstorage validation rules or submission checkers |
| 60 | +- No changes to non-DLRM workload configs (cosmoflow, resnet50, unet3d, retinanet, flux, llama3) |
| 61 | +- No changes to the `read` flag filtering logic in `config.py` or `parquet_reader.py` (unless broken for new dtypes) |
| 62 | +- No new CLI commands or arguments |
| 63 | +- No changes to benchmark execution flow or metadata structure |
| 64 | + |
| 65 | +## Dependencies |
| 66 | +- **Depends on**: Phase 9 (DLIO Parquet Support), Phase 11 (Comprehensive Parquet Support) — both complete |
| 67 | +- **Depended on by**: None currently |
| 68 | + |
| 69 | +## Integration Points |
| 70 | + |
| 71 | +### Consumes (from previous phases) |
| 72 | +- DLIO parquet generator (`parquet_generator.py`) from Phase 9/11 — extended with new dtypes |
| 73 | +- DLIO parquet reader (`parquet_reader.py`) from Phase 9/11 — verified for new dtypes |
| 74 | +- DLIO config parser (`config.py`) parquet column parsing with `read` flag from Phase 11 |
| 75 | +- Existing DLRM YAML configs from Phase 8 |
| 76 | + |
| 77 | +### Produces (for future phases) |
| 78 | +- Updated DLRM workload configs with 200 individual columns and mixed dtypes |
| 79 | +- DLIO parquet generator with int8/float16 dtype support (reusable for other workloads) |
| 80 | +- DLIO parquet reader verified for int8/float16 dtypes |
| 81 | +- Default `size: 1` behavior in parquet generator/reader (simplifies future configs) |
| 82 | + |
| 83 | +## Notes |
| 84 | +- The current DLRM configs define 3 columns: `label` (float32, size 1), `numerical_features` (float32, size 13), `categorical_features` (float32, size 26) with `record_length_bytes: 160` |
| 85 | +- The target is 200 individual columns each with `size: 1`, using scalar PyArrow types for efficiency |
| 86 | +- The `read: true`/`read: false` flags must be randomly scattered across all 200 columns, not grouped sequentially |
| 87 | +- The planner must ensure the 40 read columns total exactly 160 bytes when their dtype sizes are summed |
| 88 | +- The existing plan (PLAN-01.md) needs revision to incorporate these locked decisions — particularly the random distribution of read/unread flags, the 160-byte read constraint, and the default size=1 behavior |
0 commit comments