Skip to content

SSD benchmarks based on Mooncake Trace#1613

Open
alogfans wants to merge 6 commits intokvcache-ai:mainfrom
alogfans:add-storage-benchmark
Open

SSD benchmarks based on Mooncake Trace#1613
alogfans wants to merge 6 commits intokvcache-ai:mainfrom
alogfans:add-storage-benchmark

Conversation

@alogfans
Copy link
Collaborator

@alogfans alogfans commented Mar 5, 2026

Description

  • Add new storage benchmark tool (benchmarks/storage_benchmark/storage_benchmark.py) to evaluate KVCache storage I/O performance
  • Add comprehensive documentation (docs/source/performance/storage-benchmark.md)

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

N/A

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new benchmarking tool designed to evaluate the I/O performance of KVCache storage systems. It simulates realistic workloads based on Mooncake's OffsetAllocator and vLLM's PagedAttention architectures, providing detailed metrics on latency, bandwidth, and cache hit rates to help optimize storage performance.

Highlights

  • New KVCache Storage Benchmark Tool: A new Python script (benchmarks/storage_benchmark/storage_benchmark.py) has been added to evaluate KVCache storage I/O performance. It simulates realistic workloads using an Offset Allocator architecture and vLLM's PagedAttention block mechanism.
  • Comprehensive Documentation: Detailed documentation (docs/source/performance/storage-benchmark.md) has been included, explaining the benchmark tool's overview, test flow, command-line options, test scenarios, output examples, and performance metrics.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • benchmarks/storage_benchmark/storage_benchmark.py
    • Added a new Python script for KVCache storage benchmarking.
    • Implemented OffsetAllocatorStorage class for high-performance block storage using a single file and offset management.
    • Included StorageBenchmark class to simulate KVCache requests with prefix caching.
    • Added TraceLoader to parse request traces from JSONL files.
    • Provided functions to run the benchmark with configurable parameters and print detailed results.
  • docs/source/performance/storage-benchmark.md
    • Added new documentation detailing the KVCache storage benchmark tool.
    • Described the tool's overview, test flow, command-line options, test scenarios, output format, and performance metrics.
Activity
  • No activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new SSD benchmark tool and its accompanying documentation. The Python script is well-structured, utilizing pathlib, dataclasses, and argparse effectively. It employs low-level file I/O operations (os.pread, os.pwrite, os.fsync) appropriate for a storage benchmark, including sparse file pre-allocation and detailed statistics collection. The documentation is clear and provides good guidance on usage and metrics. Several areas for improvement have been identified, primarily concerning the use of magic numbers, resource management, and clarity in metric calculations.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@alogfans alogfans marked this pull request as draft March 6, 2026 01:31
@alogfans alogfans changed the title Add SSD benchmarks based on Mooncake Trace [WIP] Add SSD benchmarks based on Mooncake Trace Mar 6, 2026
@alogfans alogfans marked this pull request as ready for review March 6, 2026 02:56
@alogfans alogfans changed the title [WIP] Add SSD benchmarks based on Mooncake Trace SSD benchmarks based on Mooncake Trace Mar 6, 2026
# Source: https://lmcache.ai/kv_cache_calculator.html
MODEL_BYTES_PER_TOKEN = {
# Small models (7B-13B)
"llama-2-7b": 512,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like most of the numbers in this table are not correct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated values by using default configuration of https://lmcache.ai/kv_cache_calculator.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The results from this calculator seem somewhat inaccurate, since it uses the same formula to compute results for all models regardless of their differences. Similarly, there is another calculator on Hugging Face: https://huggingface.co/spaces/gaunernst/kv-cache-calculator. The results from these two calculators also seem inconsistent with each other.

start = time.time()

# Generate simulated data
data = os.urandom(self.block_size_bytes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.urandom() is called after start = time.time(), so the CPU time for generating random data is included in the measured write latency. For a 1MB block, os.urandom typically takes 3–8ms on a modern CPU, while an actual NVMe write takes 0.1–0.5ms. This means the write latency numbers are dominated by CPU cost, not disk I/O. Please move os.urandom before start = time.time()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

int: File descriptor
"""
if self.fd is None:
# Use O_RDWR | O_CREAT, no O_DIRECT (Python compatibility)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is opened without O_DIRECT, so all reads go through the OS page cache. After a block is written, subsequent reads of the same block will be served from memory rather than SSD. The sample output in the docs appears to confirm this — read P50/P95/P99 are all identical at 0.280ms, which is inconsistent with real SSD behavior. As a result, the benchmark does not actually measure SSD read performance. Consider using O_DIRECT with aligned buffers, or calling posix_fadvise(POSIX_FADV_DONTNEED) after each write to evict the page from cache. At minimum, the docs should note that results are affected by the page cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have added posix_fadvise.


## Requirements

- Python 3.8+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only support Python 3.10+ now, since PyTorch also does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@stmatengss
Copy link
Collaborator

Thoughts:
It can be added as a module in wheel. So we can run it as python3 -m mooncake.benchmark.storage_bench [args]

@alogfans
Copy link
Collaborator Author

Thoughts: It can be added as a module in wheel. So we can run it as python3 -m mooncake.benchmark.storage_bench [args]

It can be used as standalone. So I think remaining this may be better.

Copy link
Collaborator

@stmatengss stmatengss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approve with suggestions

This PR adds a valuable storage benchmark tool for evaluating KVCache I/O performance using Mooncake traces. The implementation is well-structured and documented.

Strengths ✅

Architecture

  • Clean offset allocator design mimicking Mooncake's approach
  • Single-file storage avoids file explosion
  • Uses pread/pwrite for thread-safe operations
  • Comprehensive statistics tracking

Documentation

  • Excellent inline comments explaining design decisions
  • Clear examples in docstrings
  • Well-structured code with logical sections

Functionality

  • Supports multiple model configurations (LLaMA, Qwen, DeepSeek, etc.)
  • Timestamp replay for realistic workload simulation
  • Detailed latency percentile tracking (p50/p95/p99)

Issues & Suggestions

1. Performance: fsync on every write is expensive
Line 260: os.fsync(fd) after every block write will severely impact throughput. Consider:

  • Batch fsync every N writes
  • Make fsync optional via flag
  • Use O_SYNC flag instead for better performance

2. Memory: Unbounded latency list growth
Lines 140-141, 231: Latency lists grow unbounded. For large traces (millions of requests), this will consume GBs of memory. Consider:

  • Streaming percentile calculation (t-digest or reservoir sampling)
  • Periodic aggregation and reset
  • Max list size with sampling

3. Security: os.urandom is slow
Line 251: os.urandom() is cryptographically secure but slow. For benchmarking, use:

data = bytes(self.block_size_bytes)  # Zero-filled
# or
data = bytearray(self.block_size_bytes)

4. Bug: File descriptor leak risk
No try/finally in _get_fd(). If exception occurs, fd may leak. Add proper cleanup.

5. Missing: Error handling

  • No validation for trace file format
  • No handling of disk full scenarios
  • No cleanup on benchmark failure

6. Configuration: Hard-coded constants

  • BLOCK_SIZE_TOKENS=512 should be configurable
  • MIN_LATENCY_MS seems arbitrary

7. Testing: No unit tests
Checklist shows tests not added. Consider adding:

  • Unit tests for OffsetAllocatorStorage
  • Mock trace processing tests
  • Edge case validation

See inline comments for specific code suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants