Skip to content

[RFC]: Rolling Upgrade Support for Mooncake Store #1920

@dtcccc

Description

@dtcccc

Changes proposed

Summary

Enable rolling upgrades of Mooncake Store clusters — replacing nodes one at a time without service interruption — across minor versions (e.g., 2.0 → 2.1).

Motivation

Mooncake Store clusters consist of Master nodes (metadata), Compute Clients (stateless, embedded in inference frameworks), and Storage Clients (KV cache data). Currently, upgrading any component requires a full cluster restart. Rolling upgrades are essential for production availability.

Scope

  • In scope: Mooncake codebase changes to enable rolling upgrades across minor versions.
  • Out of scope: Kubernetes Operator/Controller, cross-major-version upgrades, Transfer Engine wire protocol changes.

Current Capabilities

Master HA

  • Leader election via LeaderCoordinator (etcd and redis backends).
  • HotStandbyService with OpLog replication and Snapshot bootstrap; Standby nodes catch up in real time.
  • Master lifecycle state machine: Starting → Standby → Candidate → Recovering → CatchingUp → LeaderWarmup → Serving.
  • Clients detect Master switches via ViewVersionId changes in PingResponse and auto re-mount when NEED_REMOUNT is returned (also triggered by Ping TTL expiration and first-time connection).

Compute Client

  • Stateless by design. Can be restarted directly without data migration.

Storage Client Data Migration (Drain)

  • Segment lifecycle: OK → DRAINING → DRAINED → UNMOUNTING. DRAINING segments remain readable but reject new allocations.
  • Drain Job system via HTTP API (/api/v1/drain_jobs): Master schedules REPLICA_MOVE tasks, Clients pull via FetchTasks() and report via MarkTaskToComplete().
  • Target selection: SelectDrainTargetForKey picks the segment with lowest utilization.

Serialization

  • Client-Master RPC uses coro_rpc + struct_pack (binary). coro_rpc only supports struct_pack as its codec.
  • HTTP API uses struct_json (yalantinglibs). OpLog uses jsoncpp. Both are JSON-based and naturally forward-compatible.

Proposed Changes

1. Forward-Compatible RPC Serialization

All RPC struct fields today are plain members — none use struct_pack::compatible<T>. If any version adds or reorders a field, deserialization between old and new versions will fail. Additionally, MasterClient::Connect performs a strict server_version != client_version check that blocks any mixed-version communication.

Proposal:

  1. Adopt struct_pack::compatible<T, version> for all new RPC struct fields. Older deserializers skip unknown compatible fields; newer deserializers treat missing ones as null. No transport layer changes needed.
  2. Relax version check: modify ServiceReady to return a structured response including protocol_version (major version integer). This is a one-time breaking change as part of establishing the compatibility baseline. Replace the strict equality check with a major-version check: connection is allowed when both sides share the same major version.
  3. Define MOONCAKE_PROTOCOL_VERSION in version.h.in. Within the same major version, all struct changes must be additive-only. Breaking changes require bumping the major version.

2. Master Graceful StepDown API

Current HA failover relies on passive lease expiration. Rolling upgrades need active, controlled leader transfer.

Proposal:

Add kSteppingDown to MasterRuntimeState (rejects new writes, allows reads, drains in-flight writes). Expose via POST /api/v1/admin/stepdown.

Master Upgrade Flow: Start new-version Standby → wait for IsReadyForPromotion() → StepDown old Primary → new Standby wins election → Clients auto re-mount → remove old container.

3. Version-Aware Drain Scheduling

During rolling upgrades, SelectDrainTargetForKey may migrate data to old-version nodes that will themselves be drained later, causing redundant data movement.

Proposal:

  1. Client version reporting: add client_version as a parameter to MountSegment. Master stores a client_id → version mapping.
  2. Version-aware target selection: add a prefer_newer_targets flag to CreateDrainJobRequest. When enabled, SelectDrainTargetForKey filters candidates by utilization threshold, then prefers higher client_version among eligible candidates, with fallback to lowest utilization.

Storage Client Upgrade Flow: Start new-version Client → Drain old node's segments (with prefer_newer_targets=true) → old node GracefulShutdown → remove old container → repeat.

4. Drain Progress Enhancement

QueryJobResponse lacks total_units, making progress tracking difficult.

Proposal: Add total_units and progress_percent to QueryJobResponse. total_units is computed at Job creation time and is stable (DRAINING segments reject new writes). Expose the existing QuerySegmentStatus RPC as GET /api/v1/segments/status.

5. Client Graceful Shutdown

No structured shutdown sequence exists for Storage Clients.

Proposal: Add GracefulShutdown to the Client class: stop FetchTasks → wait for DRAINING segments to reach DRAINED (with timeout) → wait for in-flight transfers → UnmountSegment → disconnect. For standalone processes, add GET /healthz/ready (200 OK / 503 draining).

6. Client Reconnection Robustness

When Master switches, all Clients simultaneously flood the new Master with ReMountSegment requests (thundering herd).

Proposal: Add jitter and exponential backoff to the reconnection logic when NEED_REMOUNT is detected. All parameters configurable via ClientConfig.

Risks and Mitigations

Risk Mitigation
compatible<T> only supports additive changes Enforce via code review; breaking changes require major version bump
Data unavailability during Drain DRAINING segments remain readable; MoveTask is atomic
Drain timeout or failure Built-in retry (kMaxDrainUnitRetries); Cancel API for rollback
Thundering herd on Master failover Reconnection backoff with jitter (6)

Rollback Strategy

  • Master: StepDown the failed new Master (or let its lease expire); old-version Standby wins the next election.
  • Storage Client: Drain data off the failed node (or rely on replica redundancy); start old-version replacement.
  • Version mismatch: Master rejects connections with mismatched major version; Client logs error and does not retry.

Implementation Phases

  • Phase 1 (Foundation): 1 Forward-compatible RPC serialization (prerequisite for all others). 6 Reconnection robustness can be done in parallel.
  • Phase 2 (Master Upgrade): 2 StepDown API.
  • Phase 3 (Storage Client Upgrade): 3 Version-aware Drain + 4 Drain progress + 5 Graceful shutdown.

Non-Goals

  • Cross-major-version rolling upgrades (e.g., 2.x → 3.x)
  • Kubernetes Operator / Controller integration
  • Transfer Engine wire protocol changes (assumed stable across minor versions)

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions