Skip to content

E2E storage backend test#1660

Open
maheshrbapatu wants to merge 1 commit intokvcache-ai:mainfrom
maheshrbapatu:e2e_storage_backend
Open

E2E storage backend test#1660
maheshrbapatu wants to merge 1 commit intokvcache-ai:mainfrom
maheshrbapatu:e2e_storage_backend

Conversation

@maheshrbapatu
Copy link
Collaborator

Description

Implemented a new storage_backend_e2e_test that validates the storage demotion pipeline using an in-process master (InProcMaster) instead of external master + etcd orchestration.

Key changes:

Added storage_backend_e2e_test target and test registration in CMakeLists.txt.
Added new E2E coverage in storage_backend_e2e_test.cpp for:
offload and readback
cross-client readback
disk-only read after eviction
disk replica survival after writer exit
graceful behavior under quota pressure
Extended client_wrapper.h and client_wrapper.cpp with replica inspection helpers (HasDiskReplica, HasMemoryReplica, etc.) and GetWithExpectedSize to support disk-only assertions.
Stabilized eviction coverage by enforcing valid mount sizing and asserting eviction behavior in a way that matches the implementation (at least one expired seed key evicted, then validated readable from disk).

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

This change itself is about e2e tests for testing storage backend.

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the storage backend by introducing a comprehensive suite of end-to-end tests. These tests validate critical aspects of the storage demotion pipeline, ensuring data integrity and accessibility even when memory replicas are evicted or client processes fail. The new test framework utilizes an in-process master, simplifying test orchestration and improving reliability.

Highlights

  • New E2E Test Suite: Introduced a new end-to-end test suite, storage_backend_e2e_test, to validate the storage demotion pipeline using an in-process master, simplifying test orchestration and improving reliability.
  • Expanded Test Coverage: Expanded test coverage to include critical scenarios such as offload and readback, cross-client readback, disk-only reads after eviction, disk replica survival after writer exit, and graceful behavior under quota pressure.
  • Client Wrapper Enhancements: Enhanced client_wrapper with new helper functions (HasDiskReplica, HasMemoryReplica, GetWithExpectedSize) for comprehensive replica inspection and enabling disk-only assertions.
  • Eviction Test Stability: Improved eviction test stability by enforcing valid mount sizing and asserting eviction behavior more precisely to match the implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • mooncake-store/tests/e2e/CMakeLists.txt
    • Added storage_backend_e2e_test as a new executable target.
    • Registered storage_backend_e2e_test as a CTest test.
  • mooncake-store/tests/e2e/client_wrapper.cpp
    • Modified Client::Create call to pass std::nullopt for an optional parameter.
    • Implemented HasDiskReplica to check for disk-only replicas.
    • Implemented HasLocalDiskReplica to check for local disk replicas.
    • Implemented HasMemoryReplica to check for memory replicas.
    • Implemented GetWithExpectedSize to retrieve data from disk-only replicas.
  • mooncake-store/tests/e2e/client_wrapper.h
    • Declared HasDiskReplica, HasLocalDiskReplica, and HasMemoryReplica methods.
    • Declared GetWithExpectedSize method for reading disk-only replicas.
  • mooncake-store/tests/e2e/storage_backend_e2e_test.cpp
    • Added a new E2E test file storage_backend_e2e_test.cpp.
    • Defined StorageBackendE2ETest class inheriting from ::testing::Test.
    • Included test cases for OffloadAndReadback, CrossClientReadback, DiskOnlyReadAfterEviction, DiskReplicaSurvivesWriterDeath, and QuotaExhaustionGracefulBehavior.
Activity
  • The pull request was created by maheshrbapatu.
  • No human activity (comments, reviews) has occurred yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable end-to-end test suite for the storage backend, which is crucial for validating the storage demotion pipeline. The tests cover key scenarios including offload/readback, cross-client access, eviction, and fault tolerance. The implementation is solid, but I have a few suggestions to enhance code quality, correctness, and test robustness. Specifically, there's a bug in the client wrapper creation where a parameter is ignored, a test that could be made stricter, and opportunities to improve code consistency and reduce duplication.

Comment on lines +28 to +29
auto client_opt = Client::Create(hostname, metadata_connstring, protocol,
std::nullopt, master_server_entry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The device_name parameter is unused in CreateClientWrapper. It should be passed to Client::Create to properly configure the client with the specified device. The current implementation hardcodes std::nullopt, which ignores the device name provided by the caller in the new test.

Suggested change
auto client_opt = Client::Create(hostname, metadata_connstring, protocol,
std::nullopt, master_server_entry);
auto client_opt = Client::Create(hostname, metadata_connstring, protocol,
device_name, master_server_entry);

Comment on lines +343 to +351
auto err = reader->GetWithExpectedSize(keys[i], kValueSize, got);
if (err == ErrorCode::OK) {
EXPECT_EQ(got, expected_values[i])
<< "Data mismatch for " << keys[i];
} else {
LOG(WARNING)
<< "Key " << keys[i] << " not readable after writer "
<< "death: " << toString(err);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test DiskReplicaSurvivesWriterDeath should fail if reading from the disk replica fails after the writer's death. Currently, it only logs a warning, which could mask a regression. It's better to use ASSERT_EQ to ensure the read operation is successful, making the test more robust.

            std::string got;
            ASSERT_EQ(reader->GetWithExpectedSize(keys[i], kValueSize, got), ErrorCode::OK)
                << "Failed to read key " << keys[i] << " after writer death";
            EXPECT_EQ(got, expected_values[i])
                << "Data mismatch for " << keys[i];

Comment on lines +153 to +178
bool ClientTestWrapper::HasDiskReplica(const std::string& key) {
auto query_result = client_->Query(key);
if (!query_result.has_value()) return false;
for (const auto& replica : query_result.value().replicas) {
if (replica.is_disk_replica()) return true;
}
return false;
}

bool ClientTestWrapper::HasLocalDiskReplica(const std::string& key) {
auto query_result = client_->Query(key);
if (!query_result.has_value()) return false;
for (const auto& replica : query_result.value().replicas) {
if (replica.is_local_disk_replica()) return true;
}
return false;
}

bool ClientTestWrapper::HasMemoryReplica(const std::string& key) {
auto query_result = client_->Query(key);
if (!query_result.has_value()) return false;
for (const auto& replica : query_result.value().replicas) {
if (replica.is_memory_replica()) return true;
}
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These three functions (HasDiskReplica, HasLocalDiskReplica, HasMemoryReplica) have very similar implementations. You can reduce code duplication and improve readability by using std::any_of. This would make the code more concise and maintainable. Note that you might need to include <algorithm>.

bool ClientTestWrapper::HasDiskReplica(const std::string& key) {
    auto query_result = client_->Query(key);
    if (!query_result.has_value()) return false;
    const auto& replicas = query_result.value().replicas;
    return std::any_of(replicas.begin(), replicas.end(),
                       [](const auto& r) { return r.is_disk_replica(); });
}

bool ClientTestWrapper::HasLocalDiskReplica(const std::string& key) {
    auto query_result = client_->Query(key);
    if (!query_result.has_value()) return false;
    const auto& replicas = query_result.value().replicas;
    return std::any_of(replicas.begin(), replicas.end(),
                       [](const auto& r) { return r.is_local_disk_replica(); });
}

bool ClientTestWrapper::HasMemoryReplica(const std::string& key) {
    auto query_result = client_->Query(key);
    if (!query_result.has_value()) return false;
    const auto& replicas = query_result.value().replicas;
    return std::any_of(replicas.begin(), replicas.end(),
                       [](const auto& r) { return r.is_memory_replica(); });
}

}

// Wait for client TTL expiry so master cleans up the writer.
sleep(DEFAULT_CLIENT_LIVE_TTL_SEC + 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the rest of the file, which uses std::this_thread::sleep_for, consider replacing the C-style sleep() call with its modern C++ equivalent. This improves code consistency and type safety.

Suggested change
sleep(DEFAULT_CLIENT_LIVE_TTL_SEC + 5);
std::this_thread::sleep_for(std::chrono::seconds(DEFAULT_CLIENT_LIVE_TTL_SEC + 5));

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 91.73913% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ncake-store/tests/e2e/storage_backend_e2e_test.cpp 94.00% 12 Missing ⚠️
mooncake-store/tests/e2e/client_wrapper.cpp 76.66% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ykwd
Copy link
Collaborator

ykwd commented Mar 16, 2026

Thanks a lot. This is a very thorough test! We will run it locally to conduct more detailed testing of the storage backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants