Tf vdb bench - Integration#256
Conversation
This commit introduces a comprehensive KV Cache benchmark suite designed to measure storage system performance under AI/ML inference workloads, specifically targeting Large Language Model (LLM) key-value cache operations. Key components added: - Core benchmark scripts (kv-cache.py, kv-cache_sharegpt_replay.py) - Benchmark wrapper and validation tools (kv-cache-wrapper.sh, validate.sh) - Comprehensive proposal documentation for MLPerf Storage v3 integration - README with benchmark overview and usage guidelines The benchmark simulates realistic LLM inference patterns including: - Key-value cache read/write operations - Mixed sequential and random access patterns - Multi-threaded concurrent access scenarios - Conversation-based workload replay using ShareGPT dataset This work addresses the growing need to standardize storage performance measurements for AI inference workloads and provides a foundation for MLPerf Storage v3.0 KV cache benchmark specification.
Add initial KV Cache benchmark implementation for MLPerf Storage v3
This is a major architectural upgrade to the core benchmark logic. Replacing the original "Spillover" memory management strategy with the new "Waterfall LRU" implementation to accurately simulate enterprise storage hierarchies. Key Changes: - Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe). New data now correctly lands in the fastest available tier, pushing cold data down, rather than the old behavior where new data skipped directly to NVMe if RAM was full. - Static Buffer Optimization: Replaced the CPU-bound np.random generation with a pre-allocated static noise buffer. This removes the CPU bottleneck that was masking true storage latency, allowing us to fully saturate high-performance NVMe drives. - Concurrency Hardening: Added semaphore-based concurrency limits (max_concurrent_allocs) and atomic memory reservations to prevent OOM crashes under heavy load. - Storage Metrics: Added explicit tracking for nvme_tokens_processed to calculate true storage throughput separate from system throughput. - Stress Test Validation: Verified that this new architecture correctly exposes storage latency limits (e.g., pushing P95 write latency >1000ms) where the old script artificially throttled the load.
This patch addresses two bugs that surface when running the benchmark with --enable-rag: 1. Race condition in process_requests (line 2693) Worker threads begin processing requests immediately upon benchmark start, while RAG document ingestion runs in a separate daemon thread. When a worker hits the 10% RAG query path before any documents have been ingested, random.choice() is called on an empty list, raising IndexError. Fixed by adding a truthiness check on self.rag_manager.documents before entering the RAG code path. An empty dict evaluates to False, so RAG queries are safely skipped until ingestion populates at least one document. 2. Division by zero in KVCacheGenerator.generate (line 1097) The buffer slicing logic uses modulo to compute a pseudo-random start index: seed % (buffer_size - total_elements). When total_elements exactly equals buffer_size (an edge case permitted by the <= guard), the divisor becomes zero, raising ZeroDivisionError. Fixed by computing the divisor separately and defaulting start_idx to 0 when the divisor is zero.
… 4G of DRAM to reduce Queue contention and unrealistic read amplification
…lcommons#219) * feat: Replace legacy spillover logic with Waterfall LRU architecture This is a major architectural upgrade to the core benchmark logic. Replacing the original "Spillover" memory management strategy with the new "Waterfall LRU" implementation to accurately simulate enterprise storage hierarchies. Key Changes: - Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe). New data now correctly lands in the fastest available tier, pushing cold data down, rather than the old behavior where new data skipped directly to NVMe if RAM was full. - Static Buffer Optimization: Replaced the CPU-bound np.random generation with a pre-allocated static noise buffer. This removes the CPU bottleneck that was masking true storage latency, allowing us to fully saturate high-performance NVMe drives. - Concurrency Hardening: Added semaphore-based concurrency limits (max_concurrent_allocs) and atomic memory reservations to prevent OOM crashes under heavy load. - Storage Metrics: Added explicit tracking for nvme_tokens_processed to calculate true storage throughput separate from system throughput. - Stress Test Validation: Verified that this new architecture correctly exposes storage latency limits (e.g., pushing P95 write latency >1000ms) where the old script artificially throttled the load. * Fix two runtime errors in RAG-enabled benchmark mode This patch addresses two bugs that surface when running the benchmark with --enable-rag: 1. Race condition in process_requests (line 2693) Worker threads begin processing requests immediately upon benchmark start, while RAG document ingestion runs in a separate daemon thread. When a worker hits the 10% RAG query path before any documents have been ingested, random.choice() is called on an empty list, raising IndexError. Fixed by adding a truthiness check on self.rag_manager.documents before entering the RAG code path. An empty dict evaluates to False, so RAG queries are safely skipped until ingestion populates at least one document. 2. Division by zero in KVCacheGenerator.generate (line 1097) The buffer slicing logic uses modulo to compute a pseudo-random start index: seed % (buffer_size - total_elements). When total_elements exactly equals buffer_size (an edge case permitted by the <= guard), the divisor becomes zero, raising ZeroDivisionError. Fixed by computing the divisor separately and defaulting start_idx to 0 when the divisor is zero. * Add detailed README.md for running the different invocations of kv-cache.py * fix: line endings from dos2unix; increase cpu memory to 4GB for mlperf invocation * Update MLperf v3 KV cache proposal.md to recommend using a minimum of 4G of DRAM to reduce Queue contention and unrealistic read amplification
…on suite, and production tooling for MLPerf Storage v3
Summary
This commit transforms the MLPerf KV Cache benchmark from a standalone simulation tool into a production-ready validation framework. The core changes address a fundamental measurement problem discovered during real-world testing: wall-clock throughput masks true storage performance under concurrent workloads. We introduce storage throughput as the canonical metric, validate the benchmark against LMCache/vLLM production systems, and add the tooling necessary for reproducible MLPerf submissions.
Changes
1. Storage Throughput Metric (Critical Fix)
Added `storage_throughput_tokens_per_sec` to the benchmark output. This metric divides total tokens by total storage I/O time rather than elapsed wall-clock time.
Why this matters: With 50 concurrent users and async NVMe I/O, wall-clock throughput showed NVMe-only at 13,500 tok/s. The actual storage throughput is 263 tok/s. The 51x discrepancy is not a bug—it is a concurrency illusion where elapsed time hides overlapping I/O operations. Storage throughput eliminates this artifact and enables fair comparison across tiers.
Validation results confirm the expected hierarchy:
- GPU HBM: 1,691 tok/s (6.4x vs NVMe)
- GPU+CPU: 1,546 tok/s (5.9x vs NVMe)
- GPU+CPU+NVMe: 1,175 tok/s (4.4x vs NVMe)
- NVMe Only: 263 tok/s (baseline)
2. ShareGPT Dataset Integration
Merged `kv-cache_sharegpt_replay.py` into the main `kv-cache.py` implementation. The benchmark now supports both synthetic workloads and trace replay from the ShareGPT conversation dataset.
New arguments:
- `--dataset-path`: Path to ShareGPT JSON
- `--max-conversations`: Limit conversations loaded
- `--request-rate`: Control request arrival rate
- `--max-requests`: Fixed-length benchmark runs
The ShareGPT workload exhibits different characteristics than synthetic: 93% cache hit rate vs 50-70% synthetic, 133 token mean context vs 2,676 tokens synthetic. Both modes remain available—ShareGPT for production fidelity, synthetic for stress testing.
3. LMCache/vLLM Validation Suite
Created `utils/validate_lmcache.sh`, a 988-line comprehensive test harness that runs parallel benchmarks against:
- vLLM baseline (no KV caching)
- LMCache GPU-only
- LMCache CPU offload
- kv-cache.py across all tier configurations
The script includes:
- Automated hardware detection
- Environment variable configuration for LMCache
- Multi-trial averaging with statistical analysis
- Side-by-side comparison report generation
Validated against vLLM 0.13.0 and LMCache 0.3.12 on NVIDIA H100 NVL hardware.
4. Validation Results Documentation
Added `vllm_lmcache_validate/validation_results.md`, a 545-line technical document covering:
- Raw trial data across 12 benchmark configurations
- Per-tier latency analysis (P95 breakdown by GPU/CPU/NVMe)
- I/O volume analysis confirming 12:1 read/write ratio
- Bandwidth efficiency analysis explaining why trace replay shows low utilization
- KV cache sizing derivation (128KB per token for Mistral-7B)
- Complete CLI invocations for reproducibility
This document serves as the reference for understanding benchmark behavior and validating future results.
5. Excel Export and JSON Processing
Added `utils/json_to_xlsx.py` for converting benchmark JSON output to Excel format. The tool:
- Extracts both wall-clock and storage throughput metrics
- Calculates storage throughput from raw fields when summary data is unavailable
- Supports batch processing of result directories
- Falls back to CSV when openpyxl is not installed
6. Hardware Discovery Scripts
Added `utils/discovery.sh` and `utils/discovery_sharegpt.sh` for automated system interrogation:
- GPU detection via nvidia-smi
- Memory and CPU topology
- NVMe device enumeration
- Recommended benchmark parameters based on hardware profile
7. Unit Test Suite
Added `tests/test_kv_cache.py` with pytest coverage for:
- Model configuration and KV cache size calculations
- All three storage backends (GPU, CPU, NVMe)
- Multi-tier cache allocation and eviction
- Conversation manager state tracking
- User simulation and QoS distribution
GPU tests auto-skip when CUDA is unavailable.
8. Documentation Updates
Rewrote `README.md` with:
- Architecture diagrams showing waterfall eviction
- ShareGPT download instructions (wget commands)
- Complete CLI reference with all new arguments
- MLPerf submission guidelines with exact invocations
- Troubleshooting section for common issues
Files Added
utils/
validate_lmcache.sh - LMCache comparison suite (988 lines)
json_to_xlsx.py - Excel export utility
discovery.sh - Hardware detection
discovery_sharegpt.sh - ShareGPT-specific discovery
vllm_lmcache_validate/
validation_results.md - Technical validation document (545 lines)
lmcache_results_20260106_233959/
comparison_report.txt - Side-by-side results
kvcache_*.json - Raw trial data (12 files)
lmcache_*.json - LMCache trial data (6 files)
vllm_baseline_*.json - vLLM baseline (3 files)
system_info.txt - Hardware snapshot
tests/test_kv_cache.py - Unit test suite
requirements.txt - Python dependencies
Files Modified
kv-cache.py
- Added storage_throughput_tokens_per_sec to summary output
- Added elapsed_time and total_storage_io_time fields
- Merged ShareGPTDatasetLoader class
- Added --dataset-path, --max-conversations, --request-rate, --max-requests arguments
README.md
- Complete rewrite with ShareGPT documentation
- Added architecture diagrams
- Added MLPerf submission guidelines
MLperf v3 KV cache proposal.md
- Added CHANGES-01-09-2026 section documenting all updates
Breaking Changes
None. All changes are additive. Existing invocations continue to work. The synthetic workload mode remains the default when `--dataset-path` is not specified.
Validation
Tested on:
- Supermicro SYS-621H-TN12R
- 2x Intel Xeon Silver 4510
- 256 GB DDR5-4800 ECC
- NVIDIA H100 NVL (94 GB HBM3)
- 7 TB NVMe SSD
Software:
- Ubuntu 22.04 (kernel 6.5.0-15)
- Python 3.10.12
- PyTorch 2.9.0+cu128
- vLLM 0.13.0
- LMCache 0.3.12
Results reproducible with `--seed 42`.
- Update recommended benchmark invocations based on 1,411+ discovery tests - Add metric selection guidance: Decode Bytes Read (2.62x) at cpu_mem=0GB, Storage Throughput (2.2x) at cpu_mem=4GB - Document that Storage Throughput is unreliable at cpu_mem=0GB (only 1.1x) - Add trial requirements (3-5 trials) due to high variance (CV 50-125%) - Update kv-cache-wrapper.sh MLPerf workload from 2 to 4 tests - Include discovery_results_and_analysis/ with full test validation data
- Add pytest-html support for generating self-contained test reports - Update multi-tier cache tests to handle GPU tier presence flexibly - Fix tier order test to check expected tiers rather than exact order - Add pytest_configure hook for custom report metadata - Generate HTML report by default when running tests directly
Corrected values based on actual model configs: - llama3.1-8b: 128 KB (was incorrectly ~0.5 MB) - llama3.1-70b: 320 KB (was incorrectly ~5 MB) - mistral-7b: 128 KB (was incorrectly ~0.5 MB) - llama2-7b: 512 KB (MHA has 4x more KV heads than GQA) The 70b model is ~2.5x larger per token than 8b (not 10x) due to 80 layers vs 32 layers with same kv_heads=8.
Production Tooling and Validation
- Add ConfigLoader class with YAML config file support and schema validation - Add cfg() helper function for config-driven parameter access - Add validate_args() with safety limits for protected system paths - Rename all nvme_* metrics to storage_* for MLPerf terminology compliance - Add extended QoS percentiles: P99.9 and P99.99 latency tracking - Add per-tier bandwidth metrics (read/write GB/s per tier) - Add per-tier KV bytes tracking for detailed storage analysis - Fix GPU metadata desync bug via on_eviction_callback pattern - Change eviction from single-shot to iterative loop until space freed - Replace print statements with Python logging module - Add waterfall LRU eviction with configurable high/low watermarks - Add storage_health section with PASS/FAIL criteria - Add storage_throughput_tokens_per_sec as primary MLPerf metric
- Add -c DIR option for custom config directory - Generate and pass config.yaml to Python script via --config flag - Add --xlsx-output support for Excel export - Update jq queries for new storage_* metric names - Add mlperf_submission workload with required trial parameters - Enhance system detection for thread counts and memory limits - Update metric parsing for storage_throughput primary metric
- Add 170+ tests covering all new functionality - Add ConfigLoader tests: schema validation, defaults, file loading - Add cfg() helper tests for config-driven parameters - Add validate_args() tests for path safety and input validation - Add extended QoS tests for P99.9 and P99.99 percentiles - Add GPU eviction callback tests for metadata sync - Add per-tier bandwidth and KV bytes metric tests - Add storage_* metric naming tests for MLPerf compliance - Add waterfall eviction tests with high/low watermarks - Add storage_health PASS/FAIL criteria tests
- Add Configuration section with YAML parameter reference - Add MLPerf Submission Guidelines with validated commands - Add Excel metrics reference table with all output columns - Add installation instructions including pyyaml dependency - Add CLI arguments vs config file precedence documentation - Add workload definitions and tier configuration examples - Add troubleshooting section for common issues
- Add kv-cache-test-report.html with full test execution results - All 170+ tests passing for v3.0 features - Create unit_test_results directory for test artifacts
- Add P99.9 and P99.99 latency columns - Add per-tier KV bytes columns (GPU, CPU, Storage) - Add per-tier bandwidth columns (read/write GB/s) - Add storage tier device vs host latency breakdown - Rename nvme_entries to storage_entries for MLPerf compliance - Add storage_throughput_tokens_per_sec as primary metric
- Add pyyaml>=6.0 for YAML configuration file parsing - Required for ConfigLoader and --config CLI argument
- Add user_templates section with conversation patterns - Add qos_profiles with latency thresholds per tier - Add eviction settings with waterfall LRU parameters - Add storage_health criteria for PASS/FAIL determination - Add cache_sizing defaults for GPU/CPU/Storage tiers - Provides validated defaults for all tunable parameters
Split the single ~3500-line kv-cache.py into a structured Python package (kv_cache/) with 12 modules. Added MLA attention support, NVMe capacity management, SSD preconditioning, disaggregated inference modes, and streaming BurstGPT trace replay. Updated proposal and README with corrected DeepSeek-V3 MLA calculations, capacity planning scope notes, and repo cleanup. Structural changes: - kv_cache/ package: __init__, _compat, config, models, backends, cache, conversation, prefix_cache, rag, monitoring, workload, benchmark, cli - kv-cache.py is now a thin shim importing from kv_cache - Added pyproject.toml for pip-installable package New features: - MLA attention support (DeepSeek-V3: 70,272 bytes/token vs 1.7M MHA) - 4 new models: deepseek-v3, qwen3-32b, gpt-oss-120b, gpt-oss-20b - NVMe capacity tracking with LRU eviction (prevents disk exhaustion) - SSD preconditioning (--precondition) - Disaggregated inference (--prefill-only, --decode-only) - Streaming BurstGPT trace replay (--trace-speedup, --replay-cycles) - Config-driven model definitions via config.yaml - RAG retrieval distribution (zipfian/uniform), document eviction Documentation: - Corrected DeepSeek-V3 from MHA formula to MLA in all capacity tables - Scoped capacity planning claims to storage throughput (no tier promotion) - Restructured GDS section around production GPU-origin KV cache - Added NVMe terminology note (benchmark works with any block device) - Fixed stale class names and default ranges in README Repo cleanup: - Moved kv-cache-wrapper.sh to utils/ - Added utils/run_benchmarks_256gb.sh - Removed kv-cache_sharegpt_replay.py (merged into package) - Removed discovery_results_and_analysis/, lmcache_results_*, proposal PDF
README: Corrected DeepSeek-V3 KV cache from MHA formula (1,748,992 bytes/token, 1.7 MB) to MLA formula (70,272 bytes/token, 69 KB). Updated all derived tables: per-user RAM 13.4 GB -> 0.54 GB, removed from 128 GB exclusion list, fixed model reference table. Moved validate.sh to utils/ alongside other shell scripts.
The code reads decode_batch_size from config.yaml via
cfg('decode', 'batch_size', default=32). Updated the proposal
code snippet to match the actual implementation.
The "Two Separate Eviction Mechanisms" section now explicitly distinguishes metadata-only eviction (ConversationManager removes dict entries; .npy files remain on disk) from physical file deletion (MultiTierCache calls path.unlink(), permanently removing .npy files from the filesystem). Added actual code paths from backends.py and cache.py to replace the pseudocode.
Removed optional dependencies and ShareGPT dataset loader from kv-cache.py.
… checkpoint I/O
Merge streaming checkpoint implementation from streaming-checkpoint-poc branch
to complete the dgen-py optimization feature set.
This provides two complementary optimizations:
1. dgen-py integration: 155x faster data generation (already in dlio_benchmark/)
2. StreamingCheckpointing: Producer-consumer pattern with minimal memory footprint
StreamingCheckpointing Features:
- Producer-consumer architecture with shared memory buffers
- Multi-backend support (file, s3dlio) via StorageWriter interface
- Buffer pool pattern (4 buffers default, ~128MB vs 24GB for original)
- Overlapping generation and I/O for maximum throughput
- Configurable fadvise modes (none, sequential, dontneed)
Example Usage:
checkpoint = StreamingCheckpointing(
chunk_size=32 * 1024 * 1024, # 32 MB chunks
num_buffers=4, # 128 MB total memory
use_dgen=True, # Use dgen-py for generation
fadvise_mode='dontneed' # Drop pages after write
)
checkpoint.write_checkpoint(output_path, total_bytes)
Test Suite:
- tests/checkpointing/compare_methods.py demonstrates both approaches:
- Method 1: Original DLIO (pre-generate all data, uses dgen-py)
- Method 2: Streaming (producer-consumer, uses dgen-py + StreamingCheckpointing)
- Method 3: S3Checkpoint compatibility layer test
Files Added:
- mlpstorage/checkpointing/__init__.py
- mlpstorage/checkpointing/streaming_checkpoint.py (427 lines)
- mlpstorage/checkpointing/storage_writers/__init__.py
- mlpstorage/checkpointing/storage_writers/base.py
- mlpstorage/checkpointing/storage_writers/file_writer.py
- mlpstorage/checkpointing/storage_writers/s3dlio_writer.py
This completes the checkpoint optimization work, providing both:
- Speed: dgen-py 155x faster generation
- Memory: StreamingCheckpointing reduces memory from 24GB to 128MB for 24GB checkpoint
- Implement StreamingCheckpointing with producer-consumer pattern - Add storage writers for s3dlio, minio, and s3torch backends - Support multi-endpoint load balancing via environment variables - Enable concurrent checkpoint I/O without blocking training loops
- Add test_streaming_backends.py for multi-library backend testing - Add demo_checkpoint_methods.sh to demonstrate different checkpoint approaches - Add demo_streaming_checkpoint.sh for interactive streaming checkpoint demo - Update tests/README.md with detailed test documentation
- Add MULTI_ENDPOINT_GUIDE.md with comprehensive multi-endpoint documentation - Add Streaming-Chkpt-Guide.md with StreamingCheckpointing usage guide - Add pr-stream-chkpt/ directory with PR-specific documentation - Update README.md with StreamingCheckpointing section - Remove redundant MULTI_ENDPOINT.md and PR_Readiness_Plan.md - Update .gitignore to exclude Test-Backup/ and development artifacts
- Remove hardcoded AWS credentials from test_streaming_backends.py - Remove hardcoded AWS credentials from test_mlp_*.sh scripts - Replace with environment variable validation and helpful error messages - Remove internal IP address exposure (172.16.1.40) - All tests now require AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ENDPOINT_URL to be set
- Workflow documented elsewhere, not needed in PR
This change enables users to clone the fork and get a complete working environment with all multi-library storage and StreamingCheckpointing features without needing to separately manage the dlio_benchmark fork. Note: This change is ONLY for the integrated-main branch in the personal fork. The formal PR mlcommons#249 to mlcommons/storage maintains the upstream argonne-lcf/dlio_benchmark reference.
…ences - Remove outdated docs files: IMPLEMENTATION_COMPARISON.md, STORAGE_LIBRARY_HANDOFF.md, TF_ObjectBranch-Strategy.md - Remove all azstoragetorch references from STORAGE_LIBRARIES.md (library removed from project) - Remove specific performance numbers from PERFORMANCE_TESTING.md (environment-dependent) - Update PERFORMANCE_TESTING.md to show relative performance only - Rewrite STORAGE_LIBRARY_TESTING_STATUS.md to focus on HOW to run tests - Update documentation to reflect 3 supported libraries: s3dlio, minio, s3torchconnector
- Remove azstoragetorch support from benchmark_write_comparison.py - Remove azstoragetorch support from benchmark_read_comparison.py - Update documentation to reflect 3 supported libraries (s3dlio, minio, s3torchconnector) - Remove azstoragetorch examples from PARQUET_FORMATS.md - Update QUICK_START.md and README_S3DLIO_CONFIGS.md - Delete outdated HANDOFF_2026-02-07.md document azstoragetorch was never fully integrated and is not part of the project scope. The 3 core storage libraries provide complete S3/Azure/GCS coverage via s3dlio.
Update s3dlio dependency to require version 0.9.50 or newer from PyPI. This version includes all necessary features for multi-library storage support and StreamingCheckpointing.
Remove dlio_benchmark directory from git repository since it's now installed as a dependency from GitHub. This eliminates redundancy: - dlio_benchmark is installed via: git+https://github.com/russfellows/dlio_benchmark.git@main - Local directory kept for development but not tracked in git - Added dlio_benchmark/ to .gitignore - Backup created: Test-Backup/dlio_benchmark_full_20260219_105808.tar.gz This makes the repository cleaner and ensures users get dlio_benchmark from the correct source (russfellows fork with multi-library support).
- Add dgen-py>=0.2.0, minio, s3torchconnector to dependencies - Remove native Azure backend support (Azure only via s3dlio with az:// URIs) - Update documentation to clarify Azure Blob Storage exclusively via s3dlio - Remove broken references to azure_writer.AzureStorageWriter
…support
When --io-trace-log <path> is specified the benchmark runs in pure logical
trace mode: no real GPU/CPU/NVMe I/O is performed. Instead every KV cache
operation is recorded to a structured CSV file for offline replay by an
external storage tool (fio, sai3-bench, warp, etc.).
This enables clean separation between workload generation (what the
benchmark does) and storage validation (what an external tool measures),
which is essential for MLPerf Storage submission workflows.
New flags
---------
--io-trace-log <path>
Activates trace mode. Path ending in .zst enables streaming zstd
compression (level 3, ~10-20x ratio). Requires the 'zstandard' package.
--num-gpus N (default: 1)
Total GPUs in the tensor-parallel group.
Effective GPU tier capacity = N x --gpu-mem-gb.
Example: --num-gpus 8 --gpu-mem-gb 141 models an 8xH200 node (1128 GB HBM).
--tensor-parallel N (default: 1)
TP degree for KV cache sharding. Per-rank object sizes in the trace,
cache stats, and XLSX export are divided by N.
Must be >= 1 and <= --num-gpus. Non-power-of-2 values emit a warning.
CSV output format
-----------------
Columns: Timestamp, Operation, Object_Size_Bytes, Tier, Key, Phase
Timestamp Unix epoch (float, 6 decimal places)
Operation 'Write' or 'Read'
Object_Size_Bytes TP-adjusted byte size of the KV cache object
Tier 'Tier-0' (GPU), 'Tier-1' (CPU), 'Tier-2' (NVMe)
Key Cache entry identifier for replay tool correlation
Phase 'Prefill', 'Decode', or 'Evict'
Files changed
-------------
kv_cache/tracer.py New. IOTracer: thread-safe CSV writer with optional
zstd compression, Key and Phase columns, context-manager
support, clean close() sequence.
kv_cache/backends.py New NullBackend: no-op write/read that tracks byte
counts only; used for all tiers in trace mode.
kv_cache/cache.py MultiTierCache accepts io_tracer= and tensor_parallel=;
TP-adjusted size_bytes in all trace rows; per-rank
data slicing in real mode.
kv_cache/benchmark.py IntegratedBenchmark accepts io_trace_log=, num_gpus=,
tensor_parallel=; manages IOTracer lifecycle; banner
shows '8x 141 GB GPU (total 1128 GB HBM) | TP=8'.
kv_cache/cli.py --io-trace-log, --num-gpus, --tensor-parallel args;
XLSX export includes Num GPUs, Tensor Parallel, and
Total GPU Memory columns.
kv_cache/workload.py Validates TP <= num_gpus; warns if TP not power-of-2;
MAX_GPU_MEMORY_GB 1024->65536; MAX_CPU_MEMORY_GB
16384->131072 to support large multi-GPU nodes.
pyproject.toml 'compression' optional extra (zstandard>=0.21);
included in 'full' extra.
docs/io_trace_log_usage.md New user guide: all flags, CSV schema, compression
size estimates, seven ready-to-run examples (single GPU,
8xH200 TP=8, prefill-only, decode-only, DeepSeek V3),
trace inspection shell snippets, model table.
feat: add --io-trace-log trace with tensor-parallel & multi-GPU
Replaces the legacy KVCacheGenerator approach (one fixed 256 MB NumPy
buffer re-used for every write) with a double-buffered producer-consumer
pool backed by dgen-py (GIL-free Rayon Xoshiro256++). Every buffer
produced is unique; no block is ever repeated across time.
Background
----------
The old KVCacheGenerator allocated a single 256 MB float16 array at
startup (seeded with np.random.default_rng) and served every subsequent
generate() call as a per-key hash-offset view into that same pool.
At dataset scales above ~1 TB this produces ~97% block-level dedup savings
and ~1.12x zstd compressibility — making any storage benchmark using it
susceptible to being gamed by dedup/compression-capable storage tiers.
The new DataGeneratorPool uses dgen-py fill_chunk() (Rayon-parallel
Xoshiro256++, SIMD-accelerated, GIL-free) to produce cryptographic-quality
random bytes. Measured: 0% dedup savings and 1.00x compression at every
dataset size.
Changes
-------
kv_cache/data_producer.py New. DataGeneratorPool: double-buffered pool
with configurable buffer size and worker count.
Producer thread runs dgen_py.Generator.fill_chunk()
while consumer holds the previous buffer, ensuring
no generation stall. Thread-safe handoff via
threading.Event. stop() cleanly joins the producer.
kv_cache/cache.py KVCacheGenerator replaced with DataGeneratorPool;
generate() now draws from the live pool buffer
instead of indexing a static precomputed array.
Test / analysis artifacts
-------------------------
tests/bench_datagen_comparison.py
Self-contained benchmark comparing LegacyKVCacheGenerator (pre-PR) vs
InlineDgenPool (dgen-py) across: generation throughput, zstd-1
compressibility, and SHA-256 block-level dedup rate. Supports
--write-gb, --analyze-existing, --java-heap-mb, --block-size-kb,
--entry-mb, --data-dir. Calls vdbench dsim with configurable Java heap
(default 8 GB) and falls back to native SHA-256 analysis.
docs/datagen_dedup_analysis.md
Full write-up of the Feb 26 2026 analysis run on 10 GB files. Documents
the birthday-problem scaling behaviour of the old pooled generator,
explains why naive dedup predictions fail at small dataset sizes
(hash-scattered offsets vs sequential cycling), and provides the
per-scale dedup table (10 GB → 10 TB). Includes raw vdbench dsim and
SHA-256 outputs for both methods plus the vdbench heap workaround.
Performance (measured on NVMe)
-------------------------------
Generation throughput (no I/O):
Old method: ~4,300 GB/s (memory copy within cached 256 MB buffer)
New method: ~36 GB/s (Xoshiro256++ SIMD fill — real data generation)
NVMe write throughput: ~1.0 GB/s for both (I/O bound, not generation bound)
Data quality at 10 GB (4 KB block dedup):
Old method: 1.02:1 dedup ratio, 1.12x zstd compression
New method: 1.00:1 dedup ratio, 1.00x zstd compression (incompressible)
feat: zero-copy data generation via dgen-py producer-consumer pool
…ool) bench_fill_comparison.py: three-section benchmark isolating the fill function as the only variable between two identical producer-consumer pools. Replaces the old single-buffer-reuse baseline (which produced 100% deduplicatable data) with a continuously-regenerating pool for both backends. Sections: 1. Single-fill latency (1 thread, 10 iterations) — irreducible fill cost 2. Pure fill throughput (N threads, no queues, no consumer) — max fill rate 3. End-to-end consumer get_view() throughput — full pipeline Results on 12-core Xeon (4 producers, 256 MB buffers): Single fill: numpy 0.63 GB/s vs dgen-py 37.80 GB/s (60x) Pure fill: numpy 2.63 GB/s vs dgen-py 39.66 GB/s (15x) Consumer: numpy 2.70 GB/s vs dgen-py 39.76 GB/s (15x) docs/fill_comparison_results.md: results, methodology, how-to-run, and rebuttal to single-buffer numpy comparisons.
feat: add fill-rate comparison benchmark (numpy vs dgen-py)
|
MLCommons CLA bot: |
|
This was being merged into the wrong fork. Sorry. |
This PR will hopefully merge the VDB Bench branch into the main branch of this Repo, enabling easier integration back into MLCommons storage repository.