Skip to content

Tf vdb bench - Integration#256

Closed
russfellows wants to merge 67 commits intomlcommons:TF_VDBBenchfrom
russfellows:TF_VDBBench
Closed

Tf vdb bench - Integration#256
russfellows wants to merge 67 commits intomlcommons:TF_VDBBenchfrom
russfellows:TF_VDBBench

Conversation

@russfellows
Copy link

This PR will hopefully merge the VDB Bench branch into the main branch of this Repo, enabling easier integration back into MLCommons storage repository.

hazemawadalla and others added 30 commits November 21, 2025 11:47
This commit introduces a comprehensive KV Cache benchmark suite designed to
measure storage system performance under AI/ML inference workloads, specifically
targeting Large Language Model (LLM) key-value cache operations.

Key components added:
- Core benchmark scripts (kv-cache.py, kv-cache_sharegpt_replay.py)
- Benchmark wrapper and validation tools (kv-cache-wrapper.sh, validate.sh)
- Comprehensive proposal documentation for MLPerf Storage v3 integration
- README with benchmark overview and usage guidelines

The benchmark simulates realistic LLM inference patterns including:
- Key-value cache read/write operations
- Mixed sequential and random access patterns
- Multi-threaded concurrent access scenarios
- Conversation-based workload replay using ShareGPT dataset

This work addresses the growing need to standardize storage performance
measurements for AI inference workloads and provides a foundation for
MLPerf Storage v3.0 KV cache benchmark specification.
Add initial KV Cache benchmark implementation for MLPerf Storage v3
This is a major architectural upgrade to the core benchmark logic. Replacing
the original "Spillover" memory management strategy with the new "Waterfall
LRU" implementation to accurately simulate enterprise storage hierarchies.

Key Changes:
- Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe).
  New data now correctly lands in the fastest available tier, pushing cold
  data down, rather than the old behavior where new data skipped directly
  to NVMe if RAM was full.
- Static Buffer Optimization: Replaced the CPU-bound np.random generation
  with a pre-allocated static noise buffer. This removes the CPU bottleneck
  that was masking true storage latency, allowing us to fully saturate
  high-performance NVMe drives.
- Concurrency Hardening: Added semaphore-based concurrency limits
  (max_concurrent_allocs) and atomic memory reservations to prevent OOM
  crashes under heavy load.
- Storage Metrics: Added explicit tracking for nvme_tokens_processed to
  calculate true storage throughput separate from system throughput.
- Stress Test Validation: Verified that this new architecture correctly
  exposes storage latency limits (e.g., pushing P95 write latency >1000ms)
  where the old script artificially throttled the load.
This patch addresses two bugs that surface when running the benchmark
with --enable-rag:

1. Race condition in process_requests (line 2693)

   Worker threads begin processing requests immediately upon benchmark
   start, while RAG document ingestion runs in a separate daemon thread.
   When a worker hits the 10% RAG query path before any documents have
   been ingested, random.choice() is called on an empty list, raising
   IndexError.

   Fixed by adding a truthiness check on self.rag_manager.documents
   before entering the RAG code path. An empty dict evaluates to False,
   so RAG queries are safely skipped until ingestion populates at least
   one document.

2. Division by zero in KVCacheGenerator.generate (line 1097)

   The buffer slicing logic uses modulo to compute a pseudo-random start
   index: seed % (buffer_size - total_elements). When total_elements
   exactly equals buffer_size (an edge case permitted by the <= guard),
   the divisor becomes zero, raising ZeroDivisionError.

   Fixed by computing the divisor separately and defaulting start_idx
   to 0 when the divisor is zero.
… 4G of DRAM to reduce Queue contention and unrealistic read amplification
…lcommons#219)

* feat: Replace legacy spillover logic with Waterfall LRU architecture

This is a major architectural upgrade to the core benchmark logic. Replacing
the original "Spillover" memory management strategy with the new "Waterfall
LRU" implementation to accurately simulate enterprise storage hierarchies.

Key Changes:
- Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe).
  New data now correctly lands in the fastest available tier, pushing cold
  data down, rather than the old behavior where new data skipped directly
  to NVMe if RAM was full.
- Static Buffer Optimization: Replaced the CPU-bound np.random generation
  with a pre-allocated static noise buffer. This removes the CPU bottleneck
  that was masking true storage latency, allowing us to fully saturate
  high-performance NVMe drives.
- Concurrency Hardening: Added semaphore-based concurrency limits
  (max_concurrent_allocs) and atomic memory reservations to prevent OOM
  crashes under heavy load.
- Storage Metrics: Added explicit tracking for nvme_tokens_processed to
  calculate true storage throughput separate from system throughput.
- Stress Test Validation: Verified that this new architecture correctly
  exposes storage latency limits (e.g., pushing P95 write latency >1000ms)
  where the old script artificially throttled the load.

* Fix two runtime errors in RAG-enabled benchmark mode

This patch addresses two bugs that surface when running the benchmark
with --enable-rag:

1. Race condition in process_requests (line 2693)

   Worker threads begin processing requests immediately upon benchmark
   start, while RAG document ingestion runs in a separate daemon thread.
   When a worker hits the 10% RAG query path before any documents have
   been ingested, random.choice() is called on an empty list, raising
   IndexError.

   Fixed by adding a truthiness check on self.rag_manager.documents
   before entering the RAG code path. An empty dict evaluates to False,
   so RAG queries are safely skipped until ingestion populates at least
   one document.

2. Division by zero in KVCacheGenerator.generate (line 1097)

   The buffer slicing logic uses modulo to compute a pseudo-random start
   index: seed % (buffer_size - total_elements). When total_elements
   exactly equals buffer_size (an edge case permitted by the <= guard),
   the divisor becomes zero, raising ZeroDivisionError.

   Fixed by computing the divisor separately and defaulting start_idx
   to 0 when the divisor is zero.

* Add detailed README.md for running the different invocations of kv-cache.py

* fix: line endings from dos2unix; increase cpu memory to 4GB for mlperf invocation

* Update MLperf v3 KV cache proposal.md to recommend using a minimum of 4G of DRAM to reduce Queue contention and unrealistic read amplification
…on suite, and production tooling for MLPerf Storage v3

Summary

This commit transforms the MLPerf KV Cache benchmark from a standalone simulation tool into a production-ready validation framework. The core changes address a fundamental measurement problem discovered during real-world testing: wall-clock throughput masks true storage performance under concurrent workloads. We introduce storage throughput as the canonical metric, validate the benchmark against LMCache/vLLM production systems, and add the tooling necessary for reproducible MLPerf submissions.

Changes

1. Storage Throughput Metric (Critical Fix)

Added `storage_throughput_tokens_per_sec` to the benchmark output. This metric divides total tokens by total storage I/O time rather than elapsed wall-clock time.

Why this matters: With 50 concurrent users and async NVMe I/O, wall-clock throughput showed NVMe-only at 13,500 tok/s. The actual storage throughput is 263 tok/s. The 51x discrepancy is not a bug—it is a concurrency illusion where elapsed time hides overlapping I/O operations. Storage throughput eliminates this artifact and enables fair comparison across tiers.

Validation results confirm the expected hierarchy:
- GPU HBM: 1,691 tok/s (6.4x vs NVMe)
- GPU+CPU: 1,546 tok/s (5.9x vs NVMe)
- GPU+CPU+NVMe: 1,175 tok/s (4.4x vs NVMe)
- NVMe Only: 263 tok/s (baseline)

2. ShareGPT Dataset Integration

Merged `kv-cache_sharegpt_replay.py` into the main `kv-cache.py` implementation. The benchmark now supports both synthetic workloads and trace replay from the ShareGPT conversation dataset.

New arguments:
- `--dataset-path`: Path to ShareGPT JSON
- `--max-conversations`: Limit conversations loaded
- `--request-rate`: Control request arrival rate
- `--max-requests`: Fixed-length benchmark runs

The ShareGPT workload exhibits different characteristics than synthetic: 93% cache hit rate vs 50-70% synthetic, 133 token mean context vs 2,676 tokens synthetic. Both modes remain available—ShareGPT for production fidelity, synthetic for stress testing.

3. LMCache/vLLM Validation Suite

Created `utils/validate_lmcache.sh`, a 988-line comprehensive test harness that runs parallel benchmarks against:
- vLLM baseline (no KV caching)
- LMCache GPU-only
- LMCache CPU offload
- kv-cache.py across all tier configurations

The script includes:
- Automated hardware detection
- Environment variable configuration for LMCache
- Multi-trial averaging with statistical analysis
- Side-by-side comparison report generation

Validated against vLLM 0.13.0 and LMCache 0.3.12 on NVIDIA H100 NVL hardware.

4. Validation Results Documentation

Added `vllm_lmcache_validate/validation_results.md`, a 545-line technical document covering:
- Raw trial data across 12 benchmark configurations
- Per-tier latency analysis (P95 breakdown by GPU/CPU/NVMe)
- I/O volume analysis confirming 12:1 read/write ratio
- Bandwidth efficiency analysis explaining why trace replay shows low utilization
- KV cache sizing derivation (128KB per token for Mistral-7B)
- Complete CLI invocations for reproducibility

This document serves as the reference for understanding benchmark behavior and validating future results.

5. Excel Export and JSON Processing

Added `utils/json_to_xlsx.py` for converting benchmark JSON output to Excel format. The tool:
- Extracts both wall-clock and storage throughput metrics
- Calculates storage throughput from raw fields when summary data is unavailable
- Supports batch processing of result directories
- Falls back to CSV when openpyxl is not installed

6. Hardware Discovery Scripts

Added `utils/discovery.sh` and `utils/discovery_sharegpt.sh` for automated system interrogation:
- GPU detection via nvidia-smi
- Memory and CPU topology
- NVMe device enumeration
- Recommended benchmark parameters based on hardware profile

7. Unit Test Suite

Added `tests/test_kv_cache.py` with pytest coverage for:
- Model configuration and KV cache size calculations
- All three storage backends (GPU, CPU, NVMe)
- Multi-tier cache allocation and eviction
- Conversation manager state tracking
- User simulation and QoS distribution

GPU tests auto-skip when CUDA is unavailable.

8. Documentation Updates

Rewrote `README.md` with:
- Architecture diagrams showing waterfall eviction
- ShareGPT download instructions (wget commands)
- Complete CLI reference with all new arguments
- MLPerf submission guidelines with exact invocations
- Troubleshooting section for common issues

Files Added

utils/
  validate_lmcache.sh      - LMCache comparison suite (988 lines)
  json_to_xlsx.py          - Excel export utility
  discovery.sh             - Hardware detection
  discovery_sharegpt.sh    - ShareGPT-specific discovery

vllm_lmcache_validate/
  validation_results.md    - Technical validation document (545 lines)
  lmcache_results_20260106_233959/
    comparison_report.txt  - Side-by-side results
    kvcache_*.json         - Raw trial data (12 files)
    lmcache_*.json         - LMCache trial data (6 files)
    vllm_baseline_*.json   - vLLM baseline (3 files)
    system_info.txt        - Hardware snapshot

tests/test_kv_cache.py     - Unit test suite
requirements.txt           - Python dependencies

Files Modified

kv-cache.py
  - Added storage_throughput_tokens_per_sec to summary output
  - Added elapsed_time and total_storage_io_time fields
  - Merged ShareGPTDatasetLoader class
  - Added --dataset-path, --max-conversations, --request-rate, --max-requests arguments

README.md
  - Complete rewrite with ShareGPT documentation
  - Added architecture diagrams
  - Added MLPerf submission guidelines

MLperf v3 KV cache proposal.md
  - Added CHANGES-01-09-2026 section documenting all updates

Breaking Changes

None. All changes are additive. Existing invocations continue to work. The synthetic workload mode remains the default when `--dataset-path` is not specified.

Validation

Tested on:
- Supermicro SYS-621H-TN12R
- 2x Intel Xeon Silver 4510
- 256 GB DDR5-4800 ECC
- NVIDIA H100 NVL (94 GB HBM3)
- 7 TB NVMe SSD

Software:
- Ubuntu 22.04 (kernel 6.5.0-15)
- Python 3.10.12
- PyTorch 2.9.0+cu128
- vLLM 0.13.0
- LMCache 0.3.12

Results reproducible with `--seed 42`.
- Update recommended benchmark invocations based on 1,411+ discovery tests
- Add metric selection guidance: Decode Bytes Read (2.62x) at cpu_mem=0GB,
  Storage Throughput (2.2x) at cpu_mem=4GB
- Document that Storage Throughput is unreliable at cpu_mem=0GB (only 1.1x)
- Add trial requirements (3-5 trials) due to high variance (CV 50-125%)
- Update kv-cache-wrapper.sh MLPerf workload from 2 to 4 tests
- Include discovery_results_and_analysis/ with full test validation data
- Add pytest-html support for generating self-contained test reports
- Update multi-tier cache tests to handle GPU tier presence flexibly
- Fix tier order test to check expected tiers rather than exact order
- Add pytest_configure hook for custom report metadata
- Generate HTML report by default when running tests directly
Corrected values based on actual model configs:
- llama3.1-8b: 128 KB (was incorrectly ~0.5 MB)
- llama3.1-70b: 320 KB (was incorrectly ~5 MB)
- mistral-7b: 128 KB (was incorrectly ~0.5 MB)
- llama2-7b: 512 KB (MHA has 4x more KV heads than GQA)

The 70b model is ~2.5x larger per token than 8b (not 10x)
due to 80 layers vs 32 layers with same kv_heads=8.
- Add ConfigLoader class with YAML config file support and schema validation
- Add cfg() helper function for config-driven parameter access
- Add validate_args() with safety limits for protected system paths
- Rename all nvme_* metrics to storage_* for MLPerf terminology compliance
- Add extended QoS percentiles: P99.9 and P99.99 latency tracking
- Add per-tier bandwidth metrics (read/write GB/s per tier)
- Add per-tier KV bytes tracking for detailed storage analysis
- Fix GPU metadata desync bug via on_eviction_callback pattern
- Change eviction from single-shot to iterative loop until space freed
- Replace print statements with Python logging module
- Add waterfall LRU eviction with configurable high/low watermarks
- Add storage_health section with PASS/FAIL criteria
- Add storage_throughput_tokens_per_sec as primary MLPerf metric
- Add -c DIR option for custom config directory
- Generate and pass config.yaml to Python script via --config flag
- Add --xlsx-output support for Excel export
- Update jq queries for new storage_* metric names
- Add mlperf_submission workload with required trial parameters
- Enhance system detection for thread counts and memory limits
- Update metric parsing for storage_throughput primary metric
- Add 170+ tests covering all new functionality
- Add ConfigLoader tests: schema validation, defaults, file loading
- Add cfg() helper tests for config-driven parameters
- Add validate_args() tests for path safety and input validation
- Add extended QoS tests for P99.9 and P99.99 percentiles
- Add GPU eviction callback tests for metadata sync
- Add per-tier bandwidth and KV bytes metric tests
- Add storage_* metric naming tests for MLPerf compliance
- Add waterfall eviction tests with high/low watermarks
- Add storage_health PASS/FAIL criteria tests
- Add Configuration section with YAML parameter reference
- Add MLPerf Submission Guidelines with validated commands
- Add Excel metrics reference table with all output columns
- Add installation instructions including pyyaml dependency
- Add CLI arguments vs config file precedence documentation
- Add workload definitions and tier configuration examples
- Add troubleshooting section for common issues
- Add kv-cache-test-report.html with full test execution results
- All 170+ tests passing for v3.0 features
- Create unit_test_results directory for test artifacts
- Add P99.9 and P99.99 latency columns
- Add per-tier KV bytes columns (GPU, CPU, Storage)
- Add per-tier bandwidth columns (read/write GB/s)
- Add storage tier device vs host latency breakdown
- Rename nvme_entries to storage_entries for MLPerf compliance
- Add storage_throughput_tokens_per_sec as primary metric
- Add pyyaml>=6.0 for YAML configuration file parsing
- Required for ConfigLoader and --config CLI argument
- Add user_templates section with conversation patterns
- Add qos_profiles with latency thresholds per tier
- Add eviction settings with waterfall LRU parameters
- Add storage_health criteria for PASS/FAIL determination
- Add cache_sizing defaults for GPU/CPU/Storage tiers
- Provides validated defaults for all tunable parameters
Split the single ~3500-line kv-cache.py into a structured Python package
(kv_cache/) with 12 modules. Added MLA attention support, NVMe capacity
management, SSD preconditioning, disaggregated inference modes, and
streaming BurstGPT trace replay. Updated proposal and README with
corrected DeepSeek-V3 MLA calculations, capacity planning scope notes,
and repo cleanup.

Structural changes:
- kv_cache/ package: __init__, _compat, config, models, backends, cache,
  conversation, prefix_cache, rag, monitoring, workload, benchmark, cli
- kv-cache.py is now a thin shim importing from kv_cache
- Added pyproject.toml for pip-installable package

New features:
- MLA attention support (DeepSeek-V3: 70,272 bytes/token vs 1.7M MHA)
- 4 new models: deepseek-v3, qwen3-32b, gpt-oss-120b, gpt-oss-20b
- NVMe capacity tracking with LRU eviction (prevents disk exhaustion)
- SSD preconditioning (--precondition)
- Disaggregated inference (--prefill-only, --decode-only)
- Streaming BurstGPT trace replay (--trace-speedup, --replay-cycles)
- Config-driven model definitions via config.yaml
- RAG retrieval distribution (zipfian/uniform), document eviction

Documentation:
- Corrected DeepSeek-V3 from MHA formula to MLA in all capacity tables
- Scoped capacity planning claims to storage throughput (no tier promotion)
- Restructured GDS section around production GPU-origin KV cache
- Added NVMe terminology note (benchmark works with any block device)
- Fixed stale class names and default ranges in README

Repo cleanup:
- Moved kv-cache-wrapper.sh to utils/
- Added utils/run_benchmarks_256gb.sh
- Removed kv-cache_sharegpt_replay.py (merged into package)
- Removed discovery_results_and_analysis/, lmcache_results_*, proposal PDF
README: Corrected DeepSeek-V3 KV cache from MHA formula (1,748,992
bytes/token, 1.7 MB) to MLA formula (70,272 bytes/token, 69 KB).
Updated all derived tables: per-user RAM 13.4 GB -> 0.54 GB, removed
from 128 GB exclusion list, fixed model reference table.

Moved validate.sh to utils/ alongside other shell scripts.
The code reads decode_batch_size from config.yaml via
cfg('decode', 'batch_size', default=32). Updated the proposal
code snippet to match the actual implementation.
The "Two Separate Eviction Mechanisms" section now explicitly
distinguishes metadata-only eviction (ConversationManager removes
dict entries; .npy files remain on disk) from physical file deletion
(MultiTierCache calls path.unlink(), permanently removing .npy files
from the filesystem). Added actual code paths from backends.py and
cache.py to replace the pseudocode.
Removed optional dependencies and ShareGPT dataset loader from kv-cache.py.
Russ Fellows and others added 23 commits February 19, 2026 08:10
… checkpoint I/O

Merge streaming checkpoint implementation from streaming-checkpoint-poc branch
to complete the dgen-py optimization feature set.

This provides two complementary optimizations:
1. dgen-py integration: 155x faster data generation (already in dlio_benchmark/)
2. StreamingCheckpointing: Producer-consumer pattern with minimal memory footprint

StreamingCheckpointing Features:
- Producer-consumer architecture with shared memory buffers
- Multi-backend support (file, s3dlio) via StorageWriter interface
- Buffer pool pattern (4 buffers default, ~128MB vs 24GB for original)
- Overlapping generation and I/O for maximum throughput
- Configurable fadvise modes (none, sequential, dontneed)

Example Usage:
  checkpoint = StreamingCheckpointing(
      chunk_size=32 * 1024 * 1024,  # 32 MB chunks
      num_buffers=4,                 # 128 MB total memory
      use_dgen=True,                 # Use dgen-py for generation
      fadvise_mode='dontneed'        # Drop pages after write
  )
  checkpoint.write_checkpoint(output_path, total_bytes)

Test Suite:
- tests/checkpointing/compare_methods.py demonstrates both approaches:
  - Method 1: Original DLIO (pre-generate all data, uses dgen-py)
  - Method 2: Streaming (producer-consumer, uses dgen-py + StreamingCheckpointing)
  - Method 3: S3Checkpoint compatibility layer test

Files Added:
- mlpstorage/checkpointing/__init__.py
- mlpstorage/checkpointing/streaming_checkpoint.py (427 lines)
- mlpstorage/checkpointing/storage_writers/__init__.py
- mlpstorage/checkpointing/storage_writers/base.py
- mlpstorage/checkpointing/storage_writers/file_writer.py
- mlpstorage/checkpointing/storage_writers/s3dlio_writer.py

This completes the checkpoint optimization work, providing both:
- Speed: dgen-py 155x faster generation
- Memory: StreamingCheckpointing reduces memory from 24GB to 128MB for 24GB checkpoint
- Implement StreamingCheckpointing with producer-consumer pattern
- Add storage writers for s3dlio, minio, and s3torch backends
- Support multi-endpoint load balancing via environment variables
- Enable concurrent checkpoint I/O without blocking training loops
- Add test_streaming_backends.py for multi-library backend testing
- Add demo_checkpoint_methods.sh to demonstrate different checkpoint approaches
- Add demo_streaming_checkpoint.sh for interactive streaming checkpoint demo
- Update tests/README.md with detailed test documentation
- Add MULTI_ENDPOINT_GUIDE.md with comprehensive multi-endpoint documentation
- Add Streaming-Chkpt-Guide.md with StreamingCheckpointing usage guide
- Add pr-stream-chkpt/ directory with PR-specific documentation
- Update README.md with StreamingCheckpointing section
- Remove redundant MULTI_ENDPOINT.md and PR_Readiness_Plan.md
- Update .gitignore to exclude Test-Backup/ and development artifacts
- Remove hardcoded AWS credentials from test_streaming_backends.py
- Remove hardcoded AWS credentials from test_mlp_*.sh scripts
- Replace with environment variable validation and helpful error messages
- Remove internal IP address exposure (172.16.1.40)
- All tests now require AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ENDPOINT_URL to be set
- Workflow documented elsewhere, not needed in PR
This change enables users to clone the fork and get a complete working
environment with all multi-library storage and StreamingCheckpointing
features without needing to separately manage the dlio_benchmark fork.

Note: This change is ONLY for the integrated-main branch in the personal
fork. The formal PR mlcommons#249 to mlcommons/storage maintains the upstream
argonne-lcf/dlio_benchmark reference.
…ences

- Remove outdated docs files: IMPLEMENTATION_COMPARISON.md, STORAGE_LIBRARY_HANDOFF.md, TF_ObjectBranch-Strategy.md
- Remove all azstoragetorch references from STORAGE_LIBRARIES.md (library removed from project)
- Remove specific performance numbers from PERFORMANCE_TESTING.md (environment-dependent)
- Update PERFORMANCE_TESTING.md to show relative performance only
- Rewrite STORAGE_LIBRARY_TESTING_STATUS.md to focus on HOW to run tests
- Update documentation to reflect 3 supported libraries: s3dlio, minio, s3torchconnector
- Remove azstoragetorch support from benchmark_write_comparison.py
- Remove azstoragetorch support from benchmark_read_comparison.py
- Update documentation to reflect 3 supported libraries (s3dlio, minio, s3torchconnector)
- Remove azstoragetorch examples from PARQUET_FORMATS.md
- Update QUICK_START.md and README_S3DLIO_CONFIGS.md
- Delete outdated HANDOFF_2026-02-07.md document

azstoragetorch was never fully integrated and is not part of the project scope.
The 3 core storage libraries provide complete S3/Azure/GCS coverage via s3dlio.
Update s3dlio dependency to require version 0.9.50 or newer from PyPI.
This version includes all necessary features for multi-library storage
support and StreamingCheckpointing.
Remove dlio_benchmark directory from git repository since it's now
installed as a dependency from GitHub. This eliminates redundancy:

- dlio_benchmark is installed via: git+https://github.com/russfellows/dlio_benchmark.git@main
- Local directory kept for development but not tracked in git
- Added dlio_benchmark/ to .gitignore
- Backup created: Test-Backup/dlio_benchmark_full_20260219_105808.tar.gz

This makes the repository cleaner and ensures users get dlio_benchmark
from the correct source (russfellows fork with multi-library support).
- Add dgen-py>=0.2.0, minio, s3torchconnector to dependencies
- Remove native Azure backend support (Azure only via s3dlio with az:// URIs)
- Update documentation to clarify Azure Blob Storage exclusively via s3dlio
- Remove broken references to azure_writer.AzureStorageWriter
…support

When --io-trace-log <path> is specified the benchmark runs in pure logical
trace mode: no real GPU/CPU/NVMe I/O is performed. Instead every KV cache
operation is recorded to a structured CSV file for offline replay by an
external storage tool (fio, sai3-bench, warp, etc.).

This enables clean separation between workload generation (what the
benchmark does) and storage validation (what an external tool measures),
which is essential for MLPerf Storage submission workflows.

New flags
---------
--io-trace-log <path>
    Activates trace mode. Path ending in .zst enables streaming zstd
    compression (level 3, ~10-20x ratio). Requires the 'zstandard' package.

--num-gpus N  (default: 1)
    Total GPUs in the tensor-parallel group.
    Effective GPU tier capacity = N x --gpu-mem-gb.
    Example: --num-gpus 8 --gpu-mem-gb 141 models an 8xH200 node (1128 GB HBM).

--tensor-parallel N  (default: 1)
    TP degree for KV cache sharding. Per-rank object sizes in the trace,
    cache stats, and XLSX export are divided by N.
    Must be >= 1 and <= --num-gpus. Non-power-of-2 values emit a warning.

CSV output format
-----------------
Columns: Timestamp, Operation, Object_Size_Bytes, Tier, Key, Phase
  Timestamp        Unix epoch (float, 6 decimal places)
  Operation        'Write' or 'Read'
  Object_Size_Bytes  TP-adjusted byte size of the KV cache object
  Tier             'Tier-0' (GPU), 'Tier-1' (CPU), 'Tier-2' (NVMe)
  Key              Cache entry identifier for replay tool correlation
  Phase            'Prefill', 'Decode', or 'Evict'

Files changed
-------------
kv_cache/tracer.py      New. IOTracer: thread-safe CSV writer with optional
                        zstd compression, Key and Phase columns, context-manager
                        support, clean close() sequence.
kv_cache/backends.py    New NullBackend: no-op write/read that tracks byte
                        counts only; used for all tiers in trace mode.
kv_cache/cache.py       MultiTierCache accepts io_tracer= and tensor_parallel=;
                        TP-adjusted size_bytes in all trace rows; per-rank
                        data slicing in real mode.
kv_cache/benchmark.py   IntegratedBenchmark accepts io_trace_log=, num_gpus=,
                        tensor_parallel=; manages IOTracer lifecycle; banner
                        shows '8x 141 GB GPU (total 1128 GB HBM) | TP=8'.
kv_cache/cli.py         --io-trace-log, --num-gpus, --tensor-parallel args;
                        XLSX export includes Num GPUs, Tensor Parallel, and
                        Total GPU Memory columns.
kv_cache/workload.py    Validates TP <= num_gpus; warns if TP not power-of-2;
                        MAX_GPU_MEMORY_GB 1024->65536; MAX_CPU_MEMORY_GB
                        16384->131072 to support large multi-GPU nodes.
pyproject.toml          'compression' optional extra (zstandard>=0.21);
                        included in 'full' extra.
docs/io_trace_log_usage.md  New user guide: all flags, CSV schema, compression
                        size estimates, seven ready-to-run examples (single GPU,
                        8xH200 TP=8, prefill-only, decode-only, DeepSeek V3),
                        trace inspection shell snippets, model table.
feat: add --io-trace-log trace with tensor-parallel & multi-GPU
Replaces the legacy KVCacheGenerator approach (one fixed 256 MB NumPy
buffer re-used for every write) with a double-buffered producer-consumer
pool backed by dgen-py (GIL-free Rayon Xoshiro256++). Every buffer
produced is unique; no block is ever repeated across time.

Background
----------
The old KVCacheGenerator allocated a single 256 MB float16 array at
startup (seeded with np.random.default_rng) and served every subsequent
generate() call as a per-key hash-offset view into that same pool.
At dataset scales above ~1 TB this produces ~97% block-level dedup savings
and ~1.12x zstd compressibility — making any storage benchmark using it
susceptible to being gamed by dedup/compression-capable storage tiers.

The new DataGeneratorPool uses dgen-py fill_chunk() (Rayon-parallel
Xoshiro256++, SIMD-accelerated, GIL-free) to produce cryptographic-quality
random bytes. Measured: 0% dedup savings and 1.00x compression at every
dataset size.

Changes
-------
kv_cache/data_producer.py   New. DataGeneratorPool: double-buffered pool
                            with configurable buffer size and worker count.
                            Producer thread runs dgen_py.Generator.fill_chunk()
                            while consumer holds the previous buffer, ensuring
                            no generation stall. Thread-safe handoff via
                            threading.Event. stop() cleanly joins the producer.
kv_cache/cache.py           KVCacheGenerator replaced with DataGeneratorPool;
                            generate() now draws from the live pool buffer
                            instead of indexing a static precomputed array.

Test / analysis artifacts
-------------------------
tests/bench_datagen_comparison.py
    Self-contained benchmark comparing LegacyKVCacheGenerator (pre-PR) vs
    InlineDgenPool (dgen-py) across: generation throughput, zstd-1
    compressibility, and SHA-256 block-level dedup rate. Supports
    --write-gb, --analyze-existing, --java-heap-mb, --block-size-kb,
    --entry-mb, --data-dir. Calls vdbench dsim with configurable Java heap
    (default 8 GB) and falls back to native SHA-256 analysis.

docs/datagen_dedup_analysis.md
    Full write-up of the Feb 26 2026 analysis run on 10 GB files. Documents
    the birthday-problem scaling behaviour of the old pooled generator,
    explains why naive dedup predictions fail at small dataset sizes
    (hash-scattered offsets vs sequential cycling), and provides the
    per-scale dedup table (10 GB → 10 TB). Includes raw vdbench dsim and
    SHA-256 outputs for both methods plus the vdbench heap workaround.

Performance (measured on NVMe)
-------------------------------
Generation throughput (no I/O):
  Old method:  ~4,300 GB/s  (memory copy within cached 256 MB buffer)
  New method:  ~36 GB/s     (Xoshiro256++ SIMD fill — real data generation)

NVMe write throughput: ~1.0 GB/s for both (I/O bound, not generation bound)

Data quality at 10 GB (4 KB block dedup):
  Old method:  1.02:1 dedup ratio, 1.12x zstd compression
  New method:  1.00:1 dedup ratio, 1.00x zstd compression (incompressible)
feat: zero-copy data generation via dgen-py producer-consumer pool
…ool)

bench_fill_comparison.py: three-section benchmark isolating the fill
function as the only variable between two identical producer-consumer
pools. Replaces the old single-buffer-reuse baseline (which produced
100% deduplicatable data) with a continuously-regenerating pool for both
backends.

Sections:
  1. Single-fill latency (1 thread, 10 iterations) — irreducible fill cost
  2. Pure fill throughput (N threads, no queues, no consumer) — max fill rate
  3. End-to-end consumer get_view() throughput — full pipeline

Results on 12-core Xeon (4 producers, 256 MB buffers):
  Single fill:  numpy 0.63 GB/s vs dgen-py 37.80 GB/s (60x)
  Pure fill:    numpy 2.63 GB/s vs dgen-py 39.66 GB/s (15x)
  Consumer:     numpy 2.70 GB/s vs dgen-py 39.76 GB/s (15x)

docs/fill_comparison_results.md: results, methodology, how-to-run, and
rebuttal to single-buffer numpy comparisons.
feat: add fill-rate comparison benchmark (numpy vs dgen-py)
@russfellows russfellows requested a review from a team March 3, 2026 17:03
@russfellows russfellows requested a review from a team as a code owner March 3, 2026 17:03
@github-actions
Copy link

github-actions bot commented Mar 3, 2026

MLCommons CLA bot:
Thank you very much for your submission, we really appreciate it. Before we can accept your contribution, we ask that you sign the MLCommons CLA (Apache 2). Please use this [Google form] (https://forms.gle/Ew1KkBVpyeJDuRw67) to initiate authorization. If you are from an MLCommons member organization, we will request that you be added to the CLA. If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact support@mlcommons.org.
5 out of 8 committers have signed the MLCommons CLA.
@hazemawadalla
@FileSystemGuy
@BarnacleBob
@dslik
@idevasena
@eva Luator
@russ Fellows
@russfellows
Eva Luator, Russ Fellows seem not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request

@russfellows russfellows closed this Mar 3, 2026
@github-actions github-actions bot locked and limited conversation to collaborators Mar 3, 2026
@russfellows
Copy link
Author

This was being merged into the wrong fork. Sorry.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants