Tf vdb bench - Integration by russfellows · Pull Request #256 · mlcommons/storage

russfellows · 2026-03-03T17:03:58Z

This PR will hopefully merge the VDB Bench branch into the main branch of this Repo, enabling easier integration back into MLCommons storage repository.

This commit introduces a comprehensive KV Cache benchmark suite designed to measure storage system performance under AI/ML inference workloads, specifically targeting Large Language Model (LLM) key-value cache operations. Key components added: - Core benchmark scripts (kv-cache.py, kv-cache_sharegpt_replay.py) - Benchmark wrapper and validation tools (kv-cache-wrapper.sh, validate.sh) - Comprehensive proposal documentation for MLPerf Storage v3 integration - README with benchmark overview and usage guidelines The benchmark simulates realistic LLM inference patterns including: - Key-value cache read/write operations - Mixed sequential and random access patterns - Multi-threaded concurrent access scenarios - Conversation-based workload replay using ShareGPT dataset This work addresses the growing need to standardize storage performance measurements for AI inference workloads and provides a foundation for MLPerf Storage v3.0 KV cache benchmark specification.

Add initial KV Cache benchmark implementation for MLPerf Storage v3

This is a major architectural upgrade to the core benchmark logic. Replacing the original "Spillover" memory management strategy with the new "Waterfall LRU" implementation to accurately simulate enterprise storage hierarchies. Key Changes: - Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe). New data now correctly lands in the fastest available tier, pushing cold data down, rather than the old behavior where new data skipped directly to NVMe if RAM was full. - Static Buffer Optimization: Replaced the CPU-bound np.random generation with a pre-allocated static noise buffer. This removes the CPU bottleneck that was masking true storage latency, allowing us to fully saturate high-performance NVMe drives. - Concurrency Hardening: Added semaphore-based concurrency limits (max_concurrent_allocs) and atomic memory reservations to prevent OOM crashes under heavy load. - Storage Metrics: Added explicit tracking for nvme_tokens_processed to calculate true storage throughput separate from system throughput. - Stress Test Validation: Verified that this new architecture correctly exposes storage latency limits (e.g., pushing P95 write latency >1000ms) where the old script artificially throttled the load.

This patch addresses two bugs that surface when running the benchmark with --enable-rag: 1. Race condition in process_requests (line 2693) Worker threads begin processing requests immediately upon benchmark start, while RAG document ingestion runs in a separate daemon thread. When a worker hits the 10% RAG query path before any documents have been ingested, random.choice() is called on an empty list, raising IndexError. Fixed by adding a truthiness check on self.rag_manager.documents before entering the RAG code path. An empty dict evaluates to False, so RAG queries are safely skipped until ingestion populates at least one document. 2. Division by zero in KVCacheGenerator.generate (line 1097) The buffer slicing logic uses modulo to compute a pseudo-random start index: seed % (buffer_size - total_elements). When total_elements exactly equals buffer_size (an edge case permitted by the <= guard), the divisor becomes zero, raising ZeroDivisionError. Fixed by computing the divisor separately and defaulting start_idx to 0 when the divisor is zero.

…che.py

…f invocation

… 4G of DRAM to reduce Queue contention and unrealistic read amplification

…lcommons#219) * feat: Replace legacy spillover logic with Waterfall LRU architecture This is a major architectural upgrade to the core benchmark logic. Replacing the original "Spillover" memory management strategy with the new "Waterfall LRU" implementation to accurately simulate enterprise storage hierarchies. Key Changes: - Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe). New data now correctly lands in the fastest available tier, pushing cold data down, rather than the old behavior where new data skipped directly to NVMe if RAM was full. - Static Buffer Optimization: Replaced the CPU-bound np.random generation with a pre-allocated static noise buffer. This removes the CPU bottleneck that was masking true storage latency, allowing us to fully saturate high-performance NVMe drives. - Concurrency Hardening: Added semaphore-based concurrency limits (max_concurrent_allocs) and atomic memory reservations to prevent OOM crashes under heavy load. - Storage Metrics: Added explicit tracking for nvme_tokens_processed to calculate true storage throughput separate from system throughput. - Stress Test Validation: Verified that this new architecture correctly exposes storage latency limits (e.g., pushing P95 write latency >1000ms) where the old script artificially throttled the load. * Fix two runtime errors in RAG-enabled benchmark mode This patch addresses two bugs that surface when running the benchmark with --enable-rag: 1. Race condition in process_requests (line 2693) Worker threads begin processing requests immediately upon benchmark start, while RAG document ingestion runs in a separate daemon thread. When a worker hits the 10% RAG query path before any documents have been ingested, random.choice() is called on an empty list, raising IndexError. Fixed by adding a truthiness check on self.rag_manager.documents before entering the RAG code path. An empty dict evaluates to False, so RAG queries are safely skipped until ingestion populates at least one document. 2. Division by zero in KVCacheGenerator.generate (line 1097) The buffer slicing logic uses modulo to compute a pseudo-random start index: seed % (buffer_size - total_elements). When total_elements exactly equals buffer_size (an edge case permitted by the <= guard), the divisor becomes zero, raising ZeroDivisionError. Fixed by computing the divisor separately and defaulting start_idx to 0 when the divisor is zero. * Add detailed README.md for running the different invocations of kv-cache.py * fix: line endings from dos2unix; increase cpu memory to 4GB for mlperf invocation * Update MLperf v3 KV cache proposal.md to recommend using a minimum of 4G of DRAM to reduce Queue contention and unrealistic read amplification

…on suite, and production tooling for MLPerf Storage v3 Summary This commit transforms the MLPerf KV Cache benchmark from a standalone simulation tool into a production-ready validation framework. The core changes address a fundamental measurement problem discovered during real-world testing: wall-clock throughput masks true storage performance under concurrent workloads. We introduce storage throughput as the canonical metric, validate the benchmark against LMCache/vLLM production systems, and add the tooling necessary for reproducible MLPerf submissions. Changes 1. Storage Throughput Metric (Critical Fix) Added `storage_throughput_tokens_per_sec` to the benchmark output. This metric divides total tokens by total storage I/O time rather than elapsed wall-clock time. Why this matters: With 50 concurrent users and async NVMe I/O, wall-clock throughput showed NVMe-only at 13,500 tok/s. The actual storage throughput is 263 tok/s. The 51x discrepancy is not a bug—it is a concurrency illusion where elapsed time hides overlapping I/O operations. Storage throughput eliminates this artifact and enables fair comparison across tiers. Validation results confirm the expected hierarchy: - GPU HBM: 1,691 tok/s (6.4x vs NVMe) - GPU+CPU: 1,546 tok/s (5.9x vs NVMe) - GPU+CPU+NVMe: 1,175 tok/s (4.4x vs NVMe) - NVMe Only: 263 tok/s (baseline) 2. ShareGPT Dataset Integration Merged `kv-cache_sharegpt_replay.py` into the main `kv-cache.py` implementation. The benchmark now supports both synthetic workloads and trace replay from the ShareGPT conversation dataset. New arguments: - `--dataset-path`: Path to ShareGPT JSON - `--max-conversations`: Limit conversations loaded - `--request-rate`: Control request arrival rate - `--max-requests`: Fixed-length benchmark runs The ShareGPT workload exhibits different characteristics than synthetic: 93% cache hit rate vs 50-70% synthetic, 133 token mean context vs 2,676 tokens synthetic. Both modes remain available—ShareGPT for production fidelity, synthetic for stress testing. 3. LMCache/vLLM Validation Suite Created `utils/validate_lmcache.sh`, a 988-line comprehensive test harness that runs parallel benchmarks against: - vLLM baseline (no KV caching) - LMCache GPU-only - LMCache CPU offload - kv-cache.py across all tier configurations The script includes: - Automated hardware detection - Environment variable configuration for LMCache - Multi-trial averaging with statistical analysis - Side-by-side comparison report generation Validated against vLLM 0.13.0 and LMCache 0.3.12 on NVIDIA H100 NVL hardware. 4. Validation Results Documentation Added `vllm_lmcache_validate/validation_results.md`, a 545-line technical document covering: - Raw trial data across 12 benchmark configurations - Per-tier latency analysis (P95 breakdown by GPU/CPU/NVMe) - I/O volume analysis confirming 12:1 read/write ratio - Bandwidth efficiency analysis explaining why trace replay shows low utilization - KV cache sizing derivation (128KB per token for Mistral-7B) - Complete CLI invocations for reproducibility This document serves as the reference for understanding benchmark behavior and validating future results. 5. Excel Export and JSON Processing Added `utils/json_to_xlsx.py` for converting benchmark JSON output to Excel format. The tool: - Extracts both wall-clock and storage throughput metrics - Calculates storage throughput from raw fields when summary data is unavailable - Supports batch processing of result directories - Falls back to CSV when openpyxl is not installed 6. Hardware Discovery Scripts Added `utils/discovery.sh` and `utils/discovery_sharegpt.sh` for automated system interrogation: - GPU detection via nvidia-smi - Memory and CPU topology - NVMe device enumeration - Recommended benchmark parameters based on hardware profile 7. Unit Test Suite Added `tests/test_kv_cache.py` with pytest coverage for: - Model configuration and KV cache size calculations - All three storage backends (GPU, CPU, NVMe) - Multi-tier cache allocation and eviction - Conversation manager state tracking - User simulation and QoS distribution GPU tests auto-skip when CUDA is unavailable. 8. Documentation Updates Rewrote `README.md` with: - Architecture diagrams showing waterfall eviction - ShareGPT download instructions (wget commands) - Complete CLI reference with all new arguments - MLPerf submission guidelines with exact invocations - Troubleshooting section for common issues Files Added utils/ validate_lmcache.sh - LMCache comparison suite (988 lines) json_to_xlsx.py - Excel export utility discovery.sh - Hardware detection discovery_sharegpt.sh - ShareGPT-specific discovery vllm_lmcache_validate/ validation_results.md - Technical validation document (545 lines) lmcache_results_20260106_233959/ comparison_report.txt - Side-by-side results kvcache_*.json - Raw trial data (12 files) lmcache_*.json - LMCache trial data (6 files) vllm_baseline_*.json - vLLM baseline (3 files) system_info.txt - Hardware snapshot tests/test_kv_cache.py - Unit test suite requirements.txt - Python dependencies Files Modified kv-cache.py - Added storage_throughput_tokens_per_sec to summary output - Added elapsed_time and total_storage_io_time fields - Merged ShareGPTDatasetLoader class - Added --dataset-path, --max-conversations, --request-rate, --max-requests arguments README.md - Complete rewrite with ShareGPT documentation - Added architecture diagrams - Added MLPerf submission guidelines MLperf v3 KV cache proposal.md - Added CHANGES-01-09-2026 section documenting all updates Breaking Changes None. All changes are additive. Existing invocations continue to work. The synthetic workload mode remains the default when `--dataset-path` is not specified. Validation Tested on: - Supermicro SYS-621H-TN12R - 2x Intel Xeon Silver 4510 - 256 GB DDR5-4800 ECC - NVIDIA H100 NVL (94 GB HBM3) - 7 TB NVMe SSD Software: - Ubuntu 22.04 (kernel 6.5.0-15) - Python 3.10.12 - PyTorch 2.9.0+cu128 - vLLM 0.13.0 - LMCache 0.3.12 Results reproducible with `--seed 42`.

- Update recommended benchmark invocations based on 1,411+ discovery tests - Add metric selection guidance: Decode Bytes Read (2.62x) at cpu_mem=0GB, Storage Throughput (2.2x) at cpu_mem=4GB - Document that Storage Throughput is unreliable at cpu_mem=0GB (only 1.1x) - Add trial requirements (3-5 trials) due to high variance (CV 50-125%) - Update kv-cache-wrapper.sh MLPerf workload from 2 to 4 tests - Include discovery_results_and_analysis/ with full test validation data

- Add pytest-html support for generating self-contained test reports - Update multi-tier cache tests to handle GPU tier presence flexibly - Fix tier order test to check expected tiers rather than exact order - Add pytest_configure hook for custom report metadata - Generate HTML report by default when running tests directly

Corrected values based on actual model configs: - llama3.1-8b: 128 KB (was incorrectly ~0.5 MB) - llama3.1-70b: 320 KB (was incorrectly ~5 MB) - mistral-7b: 128 KB (was incorrectly ~0.5 MB) - llama2-7b: 512 KB (MHA has 4x more KV heads than GQA) The 70b model is ~2.5x larger per token than 8b (not 10x) due to 80 layers vs 32 layers with same kv_heads=8.

Production Tooling and Validation

- Add ConfigLoader class with YAML config file support and schema validation - Add cfg() helper function for config-driven parameter access - Add validate_args() with safety limits for protected system paths - Rename all nvme_* metrics to storage_* for MLPerf terminology compliance - Add extended QoS percentiles: P99.9 and P99.99 latency tracking - Add per-tier bandwidth metrics (read/write GB/s per tier) - Add per-tier KV bytes tracking for detailed storage analysis - Fix GPU metadata desync bug via on_eviction_callback pattern - Change eviction from single-shot to iterative loop until space freed - Replace print statements with Python logging module - Add waterfall LRU eviction with configurable high/low watermarks - Add storage_health section with PASS/FAIL criteria - Add storage_throughput_tokens_per_sec as primary MLPerf metric

- Add -c DIR option for custom config directory - Generate and pass config.yaml to Python script via --config flag - Add --xlsx-output support for Excel export - Update jq queries for new storage_* metric names - Add mlperf_submission workload with required trial parameters - Enhance system detection for thread counts and memory limits - Update metric parsing for storage_throughput primary metric

- Add 170+ tests covering all new functionality - Add ConfigLoader tests: schema validation, defaults, file loading - Add cfg() helper tests for config-driven parameters - Add validate_args() tests for path safety and input validation - Add extended QoS tests for P99.9 and P99.99 percentiles - Add GPU eviction callback tests for metadata sync - Add per-tier bandwidth and KV bytes metric tests - Add storage_* metric naming tests for MLPerf compliance - Add waterfall eviction tests with high/low watermarks - Add storage_health PASS/FAIL criteria tests

- Add Configuration section with YAML parameter reference - Add MLPerf Submission Guidelines with validated commands - Add Excel metrics reference table with all output columns - Add installation instructions including pyyaml dependency - Add CLI arguments vs config file precedence documentation - Add workload definitions and tier configuration examples - Add troubleshooting section for common issues

- Add kv-cache-test-report.html with full test execution results - All 170+ tests passing for v3.0 features - Create unit_test_results directory for test artifacts

- Add P99.9 and P99.99 latency columns - Add per-tier KV bytes columns (GPU, CPU, Storage) - Add per-tier bandwidth columns (read/write GB/s) - Add storage tier device vs host latency breakdown - Rename nvme_entries to storage_entries for MLPerf compliance - Add storage_throughput_tokens_per_sec as primary metric

- Add pyyaml>=6.0 for YAML configuration file parsing - Required for ConfigLoader and --config CLI argument

- Add user_templates section with conversation patterns - Add qos_profiles with latency thresholds per tier - Add eviction settings with waterfall LRU parameters - Add storage_health criteria for PASS/FAIL determination - Add cache_sizing defaults for GPU/CPU/Storage tiers - Provides validated defaults for all tunable parameters

Split the single ~3500-line kv-cache.py into a structured Python package (kv_cache/) with 12 modules. Added MLA attention support, NVMe capacity management, SSD preconditioning, disaggregated inference modes, and streaming BurstGPT trace replay. Updated proposal and README with corrected DeepSeek-V3 MLA calculations, capacity planning scope notes, and repo cleanup. Structural changes: - kv_cache/ package: __init__, _compat, config, models, backends, cache, conversation, prefix_cache, rag, monitoring, workload, benchmark, cli - kv-cache.py is now a thin shim importing from kv_cache - Added pyproject.toml for pip-installable package New features: - MLA attention support (DeepSeek-V3: 70,272 bytes/token vs 1.7M MHA) - 4 new models: deepseek-v3, qwen3-32b, gpt-oss-120b, gpt-oss-20b - NVMe capacity tracking with LRU eviction (prevents disk exhaustion) - SSD preconditioning (--precondition) - Disaggregated inference (--prefill-only, --decode-only) - Streaming BurstGPT trace replay (--trace-speedup, --replay-cycles) - Config-driven model definitions via config.yaml - RAG retrieval distribution (zipfian/uniform), document eviction Documentation: - Corrected DeepSeek-V3 from MHA formula to MLA in all capacity tables - Scoped capacity planning claims to storage throughput (no tier promotion) - Restructured GDS section around production GPU-origin KV cache - Added NVMe terminology note (benchmark works with any block device) - Fixed stale class names and default ranges in README Repo cleanup: - Moved kv-cache-wrapper.sh to utils/ - Added utils/run_benchmarks_256gb.sh - Removed kv-cache_sharegpt_replay.py (merged into package) - Removed discovery_results_and_analysis/, lmcache_results_*, proposal PDF

README: Corrected DeepSeek-V3 KV cache from MHA formula (1,748,992 bytes/token, 1.7 MB) to MLA formula (70,272 bytes/token, 69 KB). Updated all derived tables: per-user RAM 13.4 GB -> 0.54 GB, removed from 128 GB exclusion list, fixed model reference table. Moved validate.sh to utils/ alongside other shell scripts.

The code reads decode_batch_size from config.yaml via cfg('decode', 'batch_size', default=32). Updated the proposal code snippet to match the actual implementation.

The "Two Separate Eviction Mechanisms" section now explicitly distinguishes metadata-only eviction (ConversationManager removes dict entries; .npy files remain on disk) from physical file deletion (MultiTierCache calls path.unlink(), permanently removing .npy files from the filesystem). Added actual code paths from backends.py and cache.py to replace the pseudocode.

Removed optional dependencies and ShareGPT dataset loader from kv-cache.py.

… checkpoint I/O Merge streaming checkpoint implementation from streaming-checkpoint-poc branch to complete the dgen-py optimization feature set. This provides two complementary optimizations: 1. dgen-py integration: 155x faster data generation (already in dlio_benchmark/) 2. StreamingCheckpointing: Producer-consumer pattern with minimal memory footprint StreamingCheckpointing Features: - Producer-consumer architecture with shared memory buffers - Multi-backend support (file, s3dlio) via StorageWriter interface - Buffer pool pattern (4 buffers default, ~128MB vs 24GB for original) - Overlapping generation and I/O for maximum throughput - Configurable fadvise modes (none, sequential, dontneed) Example Usage: checkpoint = StreamingCheckpointing( chunk_size=32 * 1024 * 1024, # 32 MB chunks num_buffers=4, # 128 MB total memory use_dgen=True, # Use dgen-py for generation fadvise_mode='dontneed' # Drop pages after write ) checkpoint.write_checkpoint(output_path, total_bytes) Test Suite: - tests/checkpointing/compare_methods.py demonstrates both approaches: - Method 1: Original DLIO (pre-generate all data, uses dgen-py) - Method 2: Streaming (producer-consumer, uses dgen-py + StreamingCheckpointing) - Method 3: S3Checkpoint compatibility layer test Files Added: - mlpstorage/checkpointing/__init__.py - mlpstorage/checkpointing/streaming_checkpoint.py (427 lines) - mlpstorage/checkpointing/storage_writers/__init__.py - mlpstorage/checkpointing/storage_writers/base.py - mlpstorage/checkpointing/storage_writers/file_writer.py - mlpstorage/checkpointing/storage_writers/s3dlio_writer.py This completes the checkpoint optimization work, providing both: - Speed: dgen-py 155x faster generation - Memory: StreamingCheckpointing reduces memory from 24GB to 128MB for 24GB checkpoint

- Implement StreamingCheckpointing with producer-consumer pattern - Add storage writers for s3dlio, minio, and s3torch backends - Support multi-endpoint load balancing via environment variables - Enable concurrent checkpoint I/O without blocking training loops

- Add test_streaming_backends.py for multi-library backend testing - Add demo_checkpoint_methods.sh to demonstrate different checkpoint approaches - Add demo_streaming_checkpoint.sh for interactive streaming checkpoint demo - Update tests/README.md with detailed test documentation

- Add MULTI_ENDPOINT_GUIDE.md with comprehensive multi-endpoint documentation - Add Streaming-Chkpt-Guide.md with StreamingCheckpointing usage guide - Add pr-stream-chkpt/ directory with PR-specific documentation - Update README.md with StreamingCheckpointing section - Remove redundant MULTI_ENDPOINT.md and PR_Readiness_Plan.md - Update .gitignore to exclude Test-Backup/ and development artifacts

- Remove hardcoded AWS credentials from test_streaming_backends.py - Remove hardcoded AWS credentials from test_mlp_*.sh scripts - Replace with environment variable validation and helpful error messages - Remove internal IP address exposure (172.16.1.40) - All tests now require AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ENDPOINT_URL to be set

- Workflow documented elsewhere, not needed in PR

This change enables users to clone the fork and get a complete working environment with all multi-library storage and StreamingCheckpointing features without needing to separately manage the dlio_benchmark fork. Note: This change is ONLY for the integrated-main branch in the personal fork. The formal PR mlcommons#249 to mlcommons/storage maintains the upstream argonne-lcf/dlio_benchmark reference.

…ences - Remove outdated docs files: IMPLEMENTATION_COMPARISON.md, STORAGE_LIBRARY_HANDOFF.md, TF_ObjectBranch-Strategy.md - Remove all azstoragetorch references from STORAGE_LIBRARIES.md (library removed from project) - Remove specific performance numbers from PERFORMANCE_TESTING.md (environment-dependent) - Update PERFORMANCE_TESTING.md to show relative performance only - Rewrite STORAGE_LIBRARY_TESTING_STATUS.md to focus on HOW to run tests - Update documentation to reflect 3 supported libraries: s3dlio, minio, s3torchconnector

- Remove azstoragetorch support from benchmark_write_comparison.py - Remove azstoragetorch support from benchmark_read_comparison.py - Update documentation to reflect 3 supported libraries (s3dlio, minio, s3torchconnector) - Remove azstoragetorch examples from PARQUET_FORMATS.md - Update QUICK_START.md and README_S3DLIO_CONFIGS.md - Delete outdated HANDOFF_2026-02-07.md document azstoragetorch was never fully integrated and is not part of the project scope. The 3 core storage libraries provide complete S3/Azure/GCS coverage via s3dlio.

Update s3dlio dependency to require version 0.9.50 or newer from PyPI. This version includes all necessary features for multi-library storage support and StreamingCheckpointing.

Remove dlio_benchmark directory from git repository since it's now installed as a dependency from GitHub. This eliminates redundancy: - dlio_benchmark is installed via: git+https://github.com/russfellows/dlio_benchmark.git@main - Local directory kept for development but not tracked in git - Added dlio_benchmark/ to .gitignore - Backup created: Test-Backup/dlio_benchmark_full_20260219_105808.tar.gz This makes the repository cleaner and ensures users get dlio_benchmark from the correct source (russfellows fork with multi-library support).

- Add dgen-py>=0.2.0, minio, s3torchconnector to dependencies - Remove native Azure backend support (Azure only via s3dlio with az:// URIs) - Update documentation to clarify Azure Blob Storage exclusively via s3dlio - Remove broken references to azure_writer.AzureStorageWriter

…che_benchmark

…support When --io-trace-log <path> is specified the benchmark runs in pure logical trace mode: no real GPU/CPU/NVMe I/O is performed. Instead every KV cache operation is recorded to a structured CSV file for offline replay by an external storage tool (fio, sai3-bench, warp, etc.). This enables clean separation between workload generation (what the benchmark does) and storage validation (what an external tool measures), which is essential for MLPerf Storage submission workflows. New flags --------- --io-trace-log <path> Activates trace mode. Path ending in .zst enables streaming zstd compression (level 3, ~10-20x ratio). Requires the 'zstandard' package. --num-gpus N (default: 1) Total GPUs in the tensor-parallel group. Effective GPU tier capacity = N x --gpu-mem-gb. Example: --num-gpus 8 --gpu-mem-gb 141 models an 8xH200 node (1128 GB HBM). --tensor-parallel N (default: 1) TP degree for KV cache sharding. Per-rank object sizes in the trace, cache stats, and XLSX export are divided by N. Must be >= 1 and <= --num-gpus. Non-power-of-2 values emit a warning. CSV output format ----------------- Columns: Timestamp, Operation, Object_Size_Bytes, Tier, Key, Phase Timestamp Unix epoch (float, 6 decimal places) Operation 'Write' or 'Read' Object_Size_Bytes TP-adjusted byte size of the KV cache object Tier 'Tier-0' (GPU), 'Tier-1' (CPU), 'Tier-2' (NVMe) Key Cache entry identifier for replay tool correlation Phase 'Prefill', 'Decode', or 'Evict' Files changed ------------- kv_cache/tracer.py New. IOTracer: thread-safe CSV writer with optional zstd compression, Key and Phase columns, context-manager support, clean close() sequence. kv_cache/backends.py New NullBackend: no-op write/read that tracks byte counts only; used for all tiers in trace mode. kv_cache/cache.py MultiTierCache accepts io_tracer= and tensor_parallel=; TP-adjusted size_bytes in all trace rows; per-rank data slicing in real mode. kv_cache/benchmark.py IntegratedBenchmark accepts io_trace_log=, num_gpus=, tensor_parallel=; manages IOTracer lifecycle; banner shows '8x 141 GB GPU (total 1128 GB HBM) | TP=8'. kv_cache/cli.py --io-trace-log, --num-gpus, --tensor-parallel args; XLSX export includes Num GPUs, Tensor Parallel, and Total GPU Memory columns. kv_cache/workload.py Validates TP <= num_gpus; warns if TP not power-of-2; MAX_GPU_MEMORY_GB 1024->65536; MAX_CPU_MEMORY_GB 16384->131072 to support large multi-GPU nodes. pyproject.toml 'compression' optional extra (zstandard>=0.21); included in 'full' extra. docs/io_trace_log_usage.md New user guide: all flags, CSV schema, compression size estimates, seven ready-to-run examples (single GPU, 8xH200 TP=8, prefill-only, decode-only, DeepSeek V3), trace inspection shell snippets, model table.

feat: add --io-trace-log trace with tensor-parallel & multi-GPU

Replaces the legacy KVCacheGenerator approach (one fixed 256 MB NumPy buffer re-used for every write) with a double-buffered producer-consumer pool backed by dgen-py (GIL-free Rayon Xoshiro256++). Every buffer produced is unique; no block is ever repeated across time. Background ---------- The old KVCacheGenerator allocated a single 256 MB float16 array at startup (seeded with np.random.default_rng) and served every subsequent generate() call as a per-key hash-offset view into that same pool. At dataset scales above ~1 TB this produces ~97% block-level dedup savings and ~1.12x zstd compressibility — making any storage benchmark using it susceptible to being gamed by dedup/compression-capable storage tiers. The new DataGeneratorPool uses dgen-py fill_chunk() (Rayon-parallel Xoshiro256++, SIMD-accelerated, GIL-free) to produce cryptographic-quality random bytes. Measured: 0% dedup savings and 1.00x compression at every dataset size. Changes ------- kv_cache/data_producer.py New. DataGeneratorPool: double-buffered pool with configurable buffer size and worker count. Producer thread runs dgen_py.Generator.fill_chunk() while consumer holds the previous buffer, ensuring no generation stall. Thread-safe handoff via threading.Event. stop() cleanly joins the producer. kv_cache/cache.py KVCacheGenerator replaced with DataGeneratorPool; generate() now draws from the live pool buffer instead of indexing a static precomputed array. Test / analysis artifacts ------------------------- tests/bench_datagen_comparison.py Self-contained benchmark comparing LegacyKVCacheGenerator (pre-PR) vs InlineDgenPool (dgen-py) across: generation throughput, zstd-1 compressibility, and SHA-256 block-level dedup rate. Supports --write-gb, --analyze-existing, --java-heap-mb, --block-size-kb, --entry-mb, --data-dir. Calls vdbench dsim with configurable Java heap (default 8 GB) and falls back to native SHA-256 analysis. docs/datagen_dedup_analysis.md Full write-up of the Feb 26 2026 analysis run on 10 GB files. Documents the birthday-problem scaling behaviour of the old pooled generator, explains why naive dedup predictions fail at small dataset sizes (hash-scattered offsets vs sequential cycling), and provides the per-scale dedup table (10 GB → 10 TB). Includes raw vdbench dsim and SHA-256 outputs for both methods plus the vdbench heap workaround. Performance (measured on NVMe) ------------------------------- Generation throughput (no I/O): Old method: ~4,300 GB/s (memory copy within cached 256 MB buffer) New method: ~36 GB/s (Xoshiro256++ SIMD fill — real data generation) NVMe write throughput: ~1.0 GB/s for both (I/O bound, not generation bound) Data quality at 10 GB (4 KB block dedup): Old method: 1.02:1 dedup ratio, 1.12x zstd compression New method: 1.00:1 dedup ratio, 1.00x zstd compression (incompressible)

feat: zero-copy data generation via dgen-py producer-consumer pool

…ool) bench_fill_comparison.py: three-section benchmark isolating the fill function as the only variable between two identical producer-consumer pools. Replaces the old single-buffer-reuse baseline (which produced 100% deduplicatable data) with a continuously-regenerating pool for both backends. Sections: 1. Single-fill latency (1 thread, 10 iterations) — irreducible fill cost 2. Pure fill throughput (N threads, no queues, no consumer) — max fill rate 3. End-to-end consumer get_view() throughput — full pipeline Results on 12-core Xeon (4 producers, 256 MB buffers): Single fill: numpy 0.63 GB/s vs dgen-py 37.80 GB/s (60x) Pure fill: numpy 2.63 GB/s vs dgen-py 39.66 GB/s (15x) Consumer: numpy 2.70 GB/s vs dgen-py 39.76 GB/s (15x) docs/fill_comparison_results.md: results, methodology, how-to-run, and rebuttal to single-buffer numpy comparisons.

feat: add fill-rate comparison benchmark (numpy vs dgen-py)

github-actions · 2026-03-03T17:04:08Z

MLCommons CLA bot:
Thank you very much for your submission, we really appreciate it. Before we can accept your contribution, we ask that you sign the MLCommons CLA (Apache 2). Please use this [Google form] (https://forms.gle/Ew1KkBVpyeJDuRw67) to initiate authorization. If you are from an MLCommons member organization, we will request that you be added to the CLA. If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact support@mlcommons.org.
5 out of 8 committers have signed the MLCommons CLA.
✅ @hazemawadalla
✅ @FileSystemGuy
✅ @BarnacleBob
✅ @dslik
✅ @idevasena
❌ @eva Luator
❌ @russ Fellows
❌ @russfellows
Eva Luator, Russ Fellows seem not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request}

russfellows · 2026-03-03T17:19:16Z

This was being merged into the wrong fork. Sorry.

hazemawadalla and others added 30 commits November 21, 2025 11:47

Merge pull request mlcommons#214 from hazemawadalla/TF_KVCache

39246aa

Add initial KV Cache benchmark implementation for MLPerf Storage v3

Add detailed README.md for running the different invocations of kv-ca…

f78bf60

…che.py

fix: line endings from dos2unix; increase cpu memory to 4GB for mlper…

2464edf

…f invocation

Update MLperf v3 KV cache proposal.md to recommend using a minimum of…

70b8f69

… 4G of DRAM to reduce Queue contention and unrealistic read amplification

Add pytest-html dependency for HTML test reports

e016954

Add unit test HTML report showing all 112 tests passing

c1e5ff7

Update NVMe Bandwidth specification to 14,000 MB/s

e995340

Merge pull request mlcommons#224 from hazemawadalla/TF_KVCache

2159bef

Production Tooling and Validation

test(results): add pytest HTML test report

166f2b2

- Add kv-cache-test-report.html with full test execution results - All 170+ tests passing for v3.0 features - Create unit_test_results directory for test artifacts

deps(requirements): add pyyaml for config support

1bfe885

- Add pyyaml>=6.0 for YAML configuration file parsing - Required for ConfigLoader and --config CLI argument

allow claude to bypass cla check (mlcommons#234)

fafd1c6

docs: fix decode_batch_size shown as hardcoded in proposal

f4c10a2

The code reads decode_batch_size from config.yaml via cfg('decode', 'batch_size', default=32). Updated the proposal code snippet to match the actual implementation.

Remove unused imports and ShareGPT dataset loader

549c6a8

Removed optional dependencies and ShareGPT dataset loader from kv-cache.py.

Russ Fellows and others added 23 commits February 19, 2026 08:10

docs: Remove unnecessary TWO_PR_WORKFLOW.md

afb6f1f

- Workflow documented elsewhere, not needed in PR

deps: Update s3dlio requirement to version 0.9.50

0eee558

Update s3dlio dependency to require version 0.9.50 or newer from PyPI. This version includes all necessary features for multi-library storage support and StreamingCheckpointing.

Merged TF_KVCache into main, accepting all incoming changes for kv_ca…

690e6b8

…che_benchmark

Add vdb benchmark

98da6d4

Merge pull request #1 from russfellows/feature/io-trace-log

b22a1d5

feat: add --io-trace-log trace with tensor-parallel & multi-GPU

Merge pull request #2 from russfellows/feature/zero-copy-datagen

cf88bc0

feat: zero-copy data generation via dgen-py producer-consumer pool

Merge pull request #3 from russfellows/feature/bench-fill-comparison

9a9750f

feat: add fill-rate comparison benchmark (numpy vs dgen-py)

include interactive vdb collectio manager

04e8125

Merge branch 'main' into TF_VDBBench

ef3e503

vector normalization and updated diskann parameter naming

5f2e568

russfellows requested a review from a team March 3, 2026 17:03

russfellows requested a review from a team as a code owner March 3, 2026 17:03

Merge branch 'TF_VDBBench' into TF_VDBBench

901c36c

russfellows closed this Mar 3, 2026

github-actions bot locked and limited conversation to collaborators Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tf vdb bench - Integration#256

Tf vdb bench - Integration#256
russfellows wants to merge 67 commits intomlcommons:TF_VDBBenchfrom
russfellows:TF_VDBBench

russfellows commented Mar 3, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

russfellows commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants