SSD benchmarks based on Mooncake Trace#1613
SSD benchmarks based on Mooncake Trace#1613alogfans wants to merge 6 commits intokvcache-ai:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new benchmarking tool designed to evaluate the I/O performance of KVCache storage systems. It simulates realistic workloads based on Mooncake's OffsetAllocator and vLLM's PagedAttention architectures, providing detailed metrics on latency, bandwidth, and cache hit rates to help optimize storage performance. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces a new SSD benchmark tool and its accompanying documentation. The Python script is well-structured, utilizing pathlib, dataclasses, and argparse effectively. It employs low-level file I/O operations (os.pread, os.pwrite, os.fsync) appropriate for a storage benchmark, including sparse file pre-allocation and detailed statistics collection. The documentation is clear and provides good guidance on usage and metrics. Several areas for improvement have been identified, primarily concerning the use of magic numbers, resource management, and clarity in metric calculations.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
| # Source: https://lmcache.ai/kv_cache_calculator.html | ||
| MODEL_BYTES_PER_TOKEN = { | ||
| # Small models (7B-13B) | ||
| "llama-2-7b": 512, |
There was a problem hiding this comment.
Seems like most of the numbers in this table are not correct
There was a problem hiding this comment.
I have updated values by using default configuration of https://lmcache.ai/kv_cache_calculator.html
There was a problem hiding this comment.
Yes. The results from this calculator seem somewhat inaccurate, since it uses the same formula to compute results for all models regardless of their differences. Similarly, there is another calculator on Hugging Face: https://huggingface.co/spaces/gaunernst/kv-cache-calculator. The results from these two calculators also seem inconsistent with each other.
| start = time.time() | ||
|
|
||
| # Generate simulated data | ||
| data = os.urandom(self.block_size_bytes) |
There was a problem hiding this comment.
os.urandom() is called after start = time.time(), so the CPU time for generating random data is included in the measured write latency. For a 1MB block, os.urandom typically takes 3–8ms on a modern CPU, while an actual NVMe write takes 0.1–0.5ms. This means the write latency numbers are dominated by CPU cost, not disk I/O. Please move os.urandom before start = time.time()
| int: File descriptor | ||
| """ | ||
| if self.fd is None: | ||
| # Use O_RDWR | O_CREAT, no O_DIRECT (Python compatibility) |
There was a problem hiding this comment.
The file is opened without O_DIRECT, so all reads go through the OS page cache. After a block is written, subsequent reads of the same block will be served from memory rather than SSD. The sample output in the docs appears to confirm this — read P50/P95/P99 are all identical at 0.280ms, which is inconsistent with real SSD behavior. As a result, the benchmark does not actually measure SSD read performance. Consider using O_DIRECT with aligned buffers, or calling posix_fadvise(POSIX_FADV_DONTNEED) after each write to evict the page from cache. At minimum, the docs should note that results are affected by the page cache.
There was a problem hiding this comment.
We have added posix_fadvise.
|
|
||
| ## Requirements | ||
|
|
||
| - Python 3.8+ |
There was a problem hiding this comment.
I think we only support Python 3.10+ now, since PyTorch also does.
|
Thoughts: |
It can be used as standalone. So I think remaining this may be better. |
stmatengss
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approve with suggestions
This PR adds a valuable storage benchmark tool for evaluating KVCache I/O performance using Mooncake traces. The implementation is well-structured and documented.
Strengths ✅
Architecture
- Clean offset allocator design mimicking Mooncake's approach
- Single-file storage avoids file explosion
- Uses pread/pwrite for thread-safe operations
- Comprehensive statistics tracking
Documentation
- Excellent inline comments explaining design decisions
- Clear examples in docstrings
- Well-structured code with logical sections
Functionality
- Supports multiple model configurations (LLaMA, Qwen, DeepSeek, etc.)
- Timestamp replay for realistic workload simulation
- Detailed latency percentile tracking (p50/p95/p99)
Issues & Suggestions
1. Performance: fsync on every write is expensive
Line 260: os.fsync(fd) after every block write will severely impact throughput. Consider:
- Batch fsync every N writes
- Make fsync optional via flag
- Use O_SYNC flag instead for better performance
2. Memory: Unbounded latency list growth
Lines 140-141, 231: Latency lists grow unbounded. For large traces (millions of requests), this will consume GBs of memory. Consider:
- Streaming percentile calculation (t-digest or reservoir sampling)
- Periodic aggregation and reset
- Max list size with sampling
3. Security: os.urandom is slow
Line 251: os.urandom() is cryptographically secure but slow. For benchmarking, use:
data = bytes(self.block_size_bytes) # Zero-filled
# or
data = bytearray(self.block_size_bytes)4. Bug: File descriptor leak risk
No try/finally in _get_fd(). If exception occurs, fd may leak. Add proper cleanup.
5. Missing: Error handling
- No validation for trace file format
- No handling of disk full scenarios
- No cleanup on benchmark failure
6. Configuration: Hard-coded constants
- BLOCK_SIZE_TOKENS=512 should be configurable
- MIN_LATENCY_MS seems arbitrary
7. Testing: No unit tests
Checklist shows tests not added. Consider adding:
- Unit tests for OffsetAllocatorStorage
- Mock trace processing tests
- Edge case validation
See inline comments for specific code suggestions.
Description
benchmarks/storage_benchmark/storage_benchmark.py) to evaluate KVCache storage I/O performancedocs/source/performance/storage-benchmark.md)Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
N/A
Checklist
./scripts/code_format.shbefore submitting.