Changes proposed
Summary
Enable rolling upgrades of Mooncake Store clusters — replacing nodes one at a time without service interruption — across minor versions (e.g., 2.0 → 2.1).
Motivation
Mooncake Store clusters consist of Master nodes (metadata), Compute Clients (stateless, embedded in inference frameworks), and Storage Clients (KV cache data). Currently, upgrading any component requires a full cluster restart. Rolling upgrades are essential for production availability.
Scope
- In scope: Mooncake codebase changes to enable rolling upgrades across minor versions.
- Out of scope: Kubernetes Operator/Controller, cross-major-version upgrades, Transfer Engine wire protocol changes.
Current Capabilities
Master HA
- Leader election via
LeaderCoordinator (etcd and redis backends).
HotStandbyService with OpLog replication and Snapshot bootstrap; Standby nodes catch up in real time.
- Master lifecycle state machine: Starting → Standby → Candidate → Recovering → CatchingUp → LeaderWarmup → Serving.
- Clients detect Master switches via
ViewVersionId changes in PingResponse and auto re-mount when NEED_REMOUNT is returned (also triggered by Ping TTL expiration and first-time connection).
Compute Client
- Stateless by design. Can be restarted directly without data migration.
Storage Client Data Migration (Drain)
- Segment lifecycle:
OK → DRAINING → DRAINED → UNMOUNTING. DRAINING segments remain readable but reject new allocations.
- Drain Job system via HTTP API (
/api/v1/drain_jobs): Master schedules REPLICA_MOVE tasks, Clients pull via FetchTasks() and report via MarkTaskToComplete().
- Target selection:
SelectDrainTargetForKey picks the segment with lowest utilization.
Serialization
- Client-Master RPC uses
coro_rpc + struct_pack (binary). coro_rpc only supports struct_pack as its codec.
- HTTP API uses
struct_json (yalantinglibs). OpLog uses jsoncpp. Both are JSON-based and naturally forward-compatible.
Proposed Changes
1. Forward-Compatible RPC Serialization
All RPC struct fields today are plain members — none use struct_pack::compatible<T>. If any version adds or reorders a field, deserialization between old and new versions will fail. Additionally, MasterClient::Connect performs a strict server_version != client_version check that blocks any mixed-version communication.
Proposal:
- Adopt
struct_pack::compatible<T, version> for all new RPC struct fields. Older deserializers skip unknown compatible fields; newer deserializers treat missing ones as null. No transport layer changes needed.
- Relax version check: modify
ServiceReady to return a structured response including protocol_version (major version integer). This is a one-time breaking change as part of establishing the compatibility baseline. Replace the strict equality check with a major-version check: connection is allowed when both sides share the same major version.
- Define
MOONCAKE_PROTOCOL_VERSION in version.h.in. Within the same major version, all struct changes must be additive-only. Breaking changes require bumping the major version.
2. Master Graceful StepDown API
Current HA failover relies on passive lease expiration. Rolling upgrades need active, controlled leader transfer.
Proposal:
Add kSteppingDown to MasterRuntimeState (rejects new writes, allows reads, drains in-flight writes). Expose via POST /api/v1/admin/stepdown.
Master Upgrade Flow: Start new-version Standby → wait for IsReadyForPromotion() → StepDown old Primary → new Standby wins election → Clients auto re-mount → remove old container.
3. Version-Aware Drain Scheduling
During rolling upgrades, SelectDrainTargetForKey may migrate data to old-version nodes that will themselves be drained later, causing redundant data movement.
Proposal:
- Client version reporting: add
client_version as a parameter to MountSegment. Master stores a client_id → version mapping.
- Version-aware target selection: add a
prefer_newer_targets flag to CreateDrainJobRequest. When enabled, SelectDrainTargetForKey filters candidates by utilization threshold, then prefers higher client_version among eligible candidates, with fallback to lowest utilization.
Storage Client Upgrade Flow: Start new-version Client → Drain old node's segments (with prefer_newer_targets=true) → old node GracefulShutdown → remove old container → repeat.
4. Drain Progress Enhancement
QueryJobResponse lacks total_units, making progress tracking difficult.
Proposal: Add total_units and progress_percent to QueryJobResponse. total_units is computed at Job creation time and is stable (DRAINING segments reject new writes). Expose the existing QuerySegmentStatus RPC as GET /api/v1/segments/status.
5. Client Graceful Shutdown
No structured shutdown sequence exists for Storage Clients.
Proposal: Add GracefulShutdown to the Client class: stop FetchTasks → wait for DRAINING segments to reach DRAINED (with timeout) → wait for in-flight transfers → UnmountSegment → disconnect. For standalone processes, add GET /healthz/ready (200 OK / 503 draining).
6. Client Reconnection Robustness
When Master switches, all Clients simultaneously flood the new Master with ReMountSegment requests (thundering herd).
Proposal: Add jitter and exponential backoff to the reconnection logic when NEED_REMOUNT is detected. All parameters configurable via ClientConfig.
Risks and Mitigations
| Risk |
Mitigation |
compatible<T> only supports additive changes |
Enforce via code review; breaking changes require major version bump |
| Data unavailability during Drain |
DRAINING segments remain readable; MoveTask is atomic |
| Drain timeout or failure |
Built-in retry (kMaxDrainUnitRetries); Cancel API for rollback |
| Thundering herd on Master failover |
Reconnection backoff with jitter (6) |
Rollback Strategy
- Master: StepDown the failed new Master (or let its lease expire); old-version Standby wins the next election.
- Storage Client: Drain data off the failed node (or rely on replica redundancy); start old-version replacement.
- Version mismatch: Master rejects connections with mismatched major version; Client logs error and does not retry.
Implementation Phases
- Phase 1 (Foundation): 1 Forward-compatible RPC serialization (prerequisite for all others). 6 Reconnection robustness can be done in parallel.
- Phase 2 (Master Upgrade): 2 StepDown API.
- Phase 3 (Storage Client Upgrade): 3 Version-aware Drain + 4 Drain progress + 5 Graceful shutdown.
Non-Goals
- Cross-major-version rolling upgrades (e.g., 2.x → 3.x)
- Kubernetes Operator / Controller integration
- Transfer Engine wire protocol changes (assumed stable across minor versions)
Before submitting a new issue...
Changes proposed
Summary
Enable rolling upgrades of Mooncake Store clusters — replacing nodes one at a time without service interruption — across minor versions (e.g., 2.0 → 2.1).
Motivation
Mooncake Store clusters consist of Master nodes (metadata), Compute Clients (stateless, embedded in inference frameworks), and Storage Clients (KV cache data). Currently, upgrading any component requires a full cluster restart. Rolling upgrades are essential for production availability.
Scope
Current Capabilities
Master HA
LeaderCoordinator(etcd and redis backends).HotStandbyServicewith OpLog replication and Snapshot bootstrap; Standby nodes catch up in real time.ViewVersionIdchanges inPingResponseand auto re-mount whenNEED_REMOUNTis returned (also triggered by Ping TTL expiration and first-time connection).Compute Client
Storage Client Data Migration (Drain)
OK → DRAINING → DRAINED → UNMOUNTING.DRAININGsegments remain readable but reject new allocations./api/v1/drain_jobs): Master schedulesREPLICA_MOVEtasks, Clients pull viaFetchTasks()and report viaMarkTaskToComplete().SelectDrainTargetForKeypicks the segment with lowest utilization.Serialization
coro_rpc+struct_pack(binary).coro_rpconly supportsstruct_packas its codec.struct_json(yalantinglibs). OpLog uses jsoncpp. Both are JSON-based and naturally forward-compatible.Proposed Changes
1. Forward-Compatible RPC Serialization
All RPC struct fields today are plain members — none use
struct_pack::compatible<T>. If any version adds or reorders a field, deserialization between old and new versions will fail. Additionally,MasterClient::Connectperforms a strictserver_version != client_versioncheck that blocks any mixed-version communication.Proposal:
struct_pack::compatible<T, version>for all new RPC struct fields. Older deserializers skip unknowncompatiblefields; newer deserializers treat missing ones as null. No transport layer changes needed.ServiceReadyto return a structured response includingprotocol_version(major version integer). This is a one-time breaking change as part of establishing the compatibility baseline. Replace the strict equality check with a major-version check: connection is allowed when both sides share the same major version.MOONCAKE_PROTOCOL_VERSIONinversion.h.in. Within the same major version, all struct changes must be additive-only. Breaking changes require bumping the major version.2. Master Graceful StepDown API
Current HA failover relies on passive lease expiration. Rolling upgrades need active, controlled leader transfer.
Proposal:
Add
kSteppingDowntoMasterRuntimeState(rejects new writes, allows reads, drains in-flight writes). Expose viaPOST /api/v1/admin/stepdown.Master Upgrade Flow: Start new-version Standby → wait for
IsReadyForPromotion()→ StepDown old Primary → new Standby wins election → Clients auto re-mount → remove old container.3. Version-Aware Drain Scheduling
During rolling upgrades,
SelectDrainTargetForKeymay migrate data to old-version nodes that will themselves be drained later, causing redundant data movement.Proposal:
client_versionas a parameter toMountSegment. Master stores aclient_id → versionmapping.prefer_newer_targetsflag toCreateDrainJobRequest. When enabled,SelectDrainTargetForKeyfilters candidates by utilization threshold, then prefers higherclient_versionamong eligible candidates, with fallback to lowest utilization.Storage Client Upgrade Flow: Start new-version Client → Drain old node's segments (with
prefer_newer_targets=true) → old nodeGracefulShutdown→ remove old container → repeat.4. Drain Progress Enhancement
QueryJobResponselackstotal_units, making progress tracking difficult.Proposal: Add
total_unitsandprogress_percenttoQueryJobResponse.total_unitsis computed at Job creation time and is stable (DRAINING segments reject new writes). Expose the existingQuerySegmentStatusRPC asGET /api/v1/segments/status.5. Client Graceful Shutdown
No structured shutdown sequence exists for Storage Clients.
Proposal: Add
GracefulShutdownto theClientclass: stopFetchTasks→ wait for DRAINING segments to reach DRAINED (with timeout) → wait for in-flight transfers →UnmountSegment→ disconnect. For standalone processes, addGET /healthz/ready(200 OK / 503 draining).6. Client Reconnection Robustness
When Master switches, all Clients simultaneously flood the new Master with
ReMountSegmentrequests (thundering herd).Proposal: Add jitter and exponential backoff to the reconnection logic when
NEED_REMOUNTis detected. All parameters configurable viaClientConfig.Risks and Mitigations
compatible<T>only supports additive changesDRAININGsegments remain readable;MoveTaskis atomickMaxDrainUnitRetries); Cancel API for rollbackRollback Strategy
Implementation Phases
Non-Goals
Before submitting a new issue...