Skip to content

Commit f874472

Browse files
committed
Last workload yaml changes.
1 parent 0201c44 commit f874472

File tree

16 files changed

+739
-76
lines changed

16 files changed

+739
-76
lines changed

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,11 @@ htmlcov/
2929
# OS files
3030
.DS_Store
3131
Thumbs.db
32+
33+
34+
# Coding Agents
35+
.agent/
36+
.roo/
37+
.vscode/
38+
CLAUDE.md
39+
.roomodes

.planning/STATE.md

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@
44

55
**Core Value:** Orchestrate multiple benchmark types (training, checkpointing, kv-cache, vectordb) across distributed systems and produce verified, rules-compliant results.
66

7-
**Current Focus:** Phase 11 - Comprehensive Parquet Support
7+
**Current Focus:** Phase 12 - DLRM Dataset Columns
88

99
## Current Position
1010

11-
**Phase:** 11 of 11 - Comprehensive Parquet Support
12-
**Plan:** All complete
13-
**Status:** Phase complete
14-
**Last activity:** 2026-02-02 - Completed Phase 11 (Comprehensive Parquet Support)
11+
**Phase:** 12 of 12 - DLRM Dataset Columns
12+
**Plan:** Not yet planned (CONTEXT.md written)
13+
**Status:** Discussion complete, ready for planning
14+
**Last activity:** 2026-02-25 - Completed Phase 12 discussion, wrote CONTEXT.md
1515

1616
**Progress:**
1717
```
@@ -26,7 +26,8 @@ Phase 8: [##########] 100% (2/2 plans) COMPLETE
2626
Phase 9: [##########] 100% (2/2 plans) COMPLETE
2727
Phase 10: [##########] 100% (3/3 plans) COMPLETE
2828
Phase 11: [##########] 100% (3/3 plans) COMPLETE
29-
Overall: [##########] 100% (35/35 plans complete)
29+
Phase 12: [ ] 0% (0/? plans) CONTEXT WRITTEN
30+
Overall: [#########-] 97% (35/35 + Phase 12 pending)
3031
```
3132

3233
## Performance Metrics
@@ -269,26 +270,27 @@ None currently.
269270
## Session Continuity
270271

271272
### Last Session
272-
- **Date:** 2026-02-02
273-
- **Accomplished:** Completed 11-02 (Parquet Reader/Generator Rewrite)
274-
- **Next:** Execute 11-03 plan
273+
- **Date:** 2026-02-25
274+
- **Accomplished:** Completed Phase 12 discussion, wrote CONTEXT.md for 12-dlrm-dataset-columns
275+
- **Next:** Plan phase 12 (existing PLAN-01.md needs revision to match CONTEXT.md locked decisions)
275276

276277
### Context for Next Session
277-
- **MILESTONE COMPLETE:** All 10 phases executed and verified
278-
- Phase 10: Progress Indication COMPLETE
279-
- 10-01: Progress Indication Foundation - Rich library, TTY detection
280-
- 10-02: Benchmark Progress Integration - Stage indicators, spinners
281-
- 10-03: Main.py Integration - Environment/lockfile progress (human verified)
282-
- All 21 v3.0 requirements delivered
283-
- All 32 plans executed
278+
- Phase 12: DLRM Dataset Columns — CONTEXT.md written, ready for planning
279+
- Key locked decisions:
280+
- 200 columns with mixed dtypes (int8, float16, float32, float64)
281+
- 40 read columns randomly distributed (not grouped), totaling exactly 160 bytes
282+
- Scalar PyArrow types for size=1 (not FixedSizeListArray)
283+
- Default size=1 in DLIO generator/reader when not specified
284+
- Omit `size: 1` from config files for compactness
285+
- E2E verification required (generate + read back)
286+
- Existing PLAN-01.md needs revision to incorporate these constraints
284287
- User setup tasks remain (from previous phases):
285288
- Fork argonne-lcf/dlio_benchmark to personal GitHub
286289
- Push parquet-support branch to fork
287290
- Update pyproject.toml with fork URL
288291
- Note: vectordb and kvcache not yet wired into cli_parser.py (minor CLI wiring)
289292
- Note: Pre-existing test failures in test_rules_calculations.py and test_reporting.py (unrelated)
290-
- Ready for milestone audit or archival
291293

292294
---
293295

294-
*State updated: 2026-01-25*
296+
*State updated: 2026-02-25*
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
phase: "12-dlrm-dataset-columns"
3+
created_at: "2026-02-25T17:54:00Z"
4+
status: planning
5+
---
6+
7+
# Phase Context: DLRM Dataset Columns
8+
9+
## Goal
10+
Expand the DLRM workload from 3 packed columns to 200 individual columns with mixed dtypes (int8, float16, float32, float64), where 40 randomly distributed columns are read during the workload and 160 are not.
11+
12+
## Requirements Covered
13+
- TRAIN-07: Expand DLRM parquet columns to 200 individual features with mixed dtypes and selective column reading
14+
15+
## Success Criteria
16+
1. DLRM parquet data generation produces files with 200 individual columns using int8, float16, float32, and float64 dtypes
17+
2. DLRM benchmark reads only the 40 columns marked `read: true`, totaling exactly 160 bytes per record
18+
3. The 40 read columns and 160 unread columns are randomly distributed across all 200 columns (not grouped by index)
19+
4. End-to-end verification passes: generate parquet data with 200-column config and read it back programmatically confirming all dtypes work correctly
20+
21+
## Locked Decisions
22+
These decisions are NON-NEGOTIABLE. All plans and execution must respect them.
23+
24+
| Decision | Value | Rationale |
25+
|----------|-------|-----------|
26+
| Total column count | 200 columns (`feature_000`–`feature_199`), each size 1 | Expand from 3 packed columns to individual feature columns |
27+
| Read column count | 40 columns with `read: true`, randomly distributed across all 200 | Simulate realistic column access patterns (not sequential grouping) |
28+
| Unread column count | 160 columns with `read: false`, randomly distributed (complement of read) | Simulate real DLRM wide-table with selective column reads |
29+
| Read columns byte total | Read columns must total exactly 160 bytes | Preserve original workload I/O characteristics |
30+
| Read column dtype pattern | Categoricals use mix of int8/float16/float32/float64; numericals use float32 — weighted toward real DLRM patterns | Realistic DLRM representation |
31+
| `record_length_bytes` value | Total bytes of all 200 columns (not just read columns) | Represents full on-disk record size |
32+
| Scalar PyArrow types for size=1 | Use scalar `pa.int8()`, `pa.float16()`, `pa.float32()`, `pa.float64()` — not `FixedSizeListArray` for size=1 | Optimize read throughput, minimize copies |
33+
| Add int8 and float16 dtype support | Add to DLIO parquet generator `_build_schema()` and `_generate_column_data_batch()` | Required for mixed-dtype columns |
34+
| Verify and fix ParquetReader | Ensure ParquetReader handles int8 and float16 dtypes correctly; fix if needed | Reader must handle all new dtypes |
35+
| Default size to 1 | DLIO parquet generator and reader default `size` to 1 when not specified in config | Simplify config, reduce verbosity |
36+
| Omit `size: 1` from configs | Config files should not include `size: 1` — only specify `size` when it differs from default | Keep 200-column configs compact |
37+
| E2E verification required | Generate parquet data with 200-column config and read it back programmatically | Confirm full pipeline works |
38+
| No validation rule changes | mlpstorage validation rules do not need updates for this phase | Existing rules sufficient |
39+
40+
## Deferred Decisions
41+
These decisions should NOT be made during this phase. They belong to a later phase.
42+
43+
- Exact dtype distribution for the 26 categorical read columns (planner determines, constrained so all 40 read columns total 160 bytes)
44+
- Exact dtype distribution for the 160 unread columns (any reasonable mix of int8/float16/float32/float64)
45+
- Which specific column indices get `read: true` (planner uses a seed for reproducibility)
46+
47+
## Assumptions
48+
These are assumed true unless explicitly contradicted by the user or requirements.
49+
50+
- All 3 DLRM config files (`dlrm_b200.yaml`, `dlrm_mi355.yaml`, `dlrm_datagen.yaml`) get identical column definitions
51+
- The `read` flag filtering in `config.py` and `parquet_reader.py` already works correctly (no changes needed to filtering logic)
52+
- PyArrow supports `pa.int8()` and `pa.float16()` natively in the installed version
53+
- Existing configs with explicit `size` values continue to work (backward compatible default)
54+
- The original DLRM workload had 1 label + 13 numerical + 26 categorical = 40 values at 4 bytes each = 160 bytes
55+
56+
## Anti-Goals
57+
This phase explicitly does NOT do the following. Do not add tasks for these.
58+
59+
- No changes to mlpstorage validation rules or submission checkers
60+
- No changes to non-DLRM workload configs (cosmoflow, resnet50, unet3d, retinanet, flux, llama3)
61+
- No changes to the `read` flag filtering logic in `config.py` or `parquet_reader.py` (unless broken for new dtypes)
62+
- No new CLI commands or arguments
63+
- No changes to benchmark execution flow or metadata structure
64+
65+
## Dependencies
66+
- **Depends on**: Phase 9 (DLIO Parquet Support), Phase 11 (Comprehensive Parquet Support) — both complete
67+
- **Depended on by**: None currently
68+
69+
## Integration Points
70+
71+
### Consumes (from previous phases)
72+
- DLIO parquet generator (`parquet_generator.py`) from Phase 9/11 — extended with new dtypes
73+
- DLIO parquet reader (`parquet_reader.py`) from Phase 9/11 — verified for new dtypes
74+
- DLIO config parser (`config.py`) parquet column parsing with `read` flag from Phase 11
75+
- Existing DLRM YAML configs from Phase 8
76+
77+
### Produces (for future phases)
78+
- Updated DLRM workload configs with 200 individual columns and mixed dtypes
79+
- DLIO parquet generator with int8/float16 dtype support (reusable for other workloads)
80+
- DLIO parquet reader verified for int8/float16 dtypes
81+
- Default `size: 1` behavior in parquet generator/reader (simplifies future configs)
82+
83+
## Notes
84+
- The current DLRM configs define 3 columns: `label` (float32, size 1), `numerical_features` (float32, size 13), `categorical_features` (float32, size 26) with `record_length_bytes: 160`
85+
- The target is 200 individual columns each with `size: 1`, using scalar PyArrow types for efficiency
86+
- The `read: true`/`read: false` flags must be randomly scattered across all 200 columns, not grouped sequentially
87+
- The planner must ensure the 40 read columns total exactly 160 bytes when their dtype sizes are summed
88+
- The existing plan (PLAN-01.md) needs revision to incorporate these locked decisions — particularly the random distribution of read/unread flags, the 160-byte read constraint, and the default size=1 behavior

0 commit comments

Comments
 (0)