[RFC]: Rolling Upgrade Support for Mooncake Store

### Changes proposed

## Summary

Enable rolling upgrades of Mooncake Store clusters — replacing nodes one at a time without service interruption — across minor versions (e.g., 2.0 → 2.1).

## Motivation

Mooncake Store clusters consist of Master nodes (metadata), Compute Clients (stateless, embedded in inference frameworks), and Storage Clients (KV cache data). Currently, upgrading any component requires a full cluster restart. Rolling upgrades are essential for production availability.

## Scope

- **In scope**: Mooncake codebase changes to enable rolling upgrades across minor versions.
- **Out of scope**: Kubernetes Operator/Controller, cross-major-version upgrades, Transfer Engine wire protocol changes.

## Current Capabilities

### Master HA

- Leader election via `LeaderCoordinator` (etcd and redis backends).
- `HotStandbyService` with OpLog replication and Snapshot bootstrap; Standby nodes catch up in real time.
- Master lifecycle state machine: Starting → Standby → Candidate → Recovering → CatchingUp → LeaderWarmup → Serving.
- Clients detect Master switches via `ViewVersionId` changes in `PingResponse` and auto re-mount when `NEED_REMOUNT` is returned (also triggered by Ping TTL expiration and first-time connection).

### Compute Client

- Stateless by design. Can be restarted directly without data migration.

### Storage Client Data Migration (Drain)

- Segment lifecycle: `OK → DRAINING → DRAINED → UNMOUNTING`. `DRAINING` segments remain readable but reject new allocations.
- Drain Job system via HTTP API (`/api/v1/drain_jobs`): Master schedules `REPLICA_MOVE` tasks, Clients pull via `FetchTasks()` and report via `MarkTaskToComplete()`.
- Target selection: `SelectDrainTargetForKey` picks the segment with lowest utilization.

### Serialization

- Client-Master RPC uses `coro_rpc` + `struct_pack` (binary). `coro_rpc` only supports `struct_pack` as its codec.
- HTTP API uses `struct_json` (yalantinglibs). OpLog uses jsoncpp. Both are JSON-based and naturally forward-compatible.

## Proposed Changes

### 1. Forward-Compatible RPC Serialization

All RPC struct fields today are plain members — none use `struct_pack::compatible<T>`. If any version adds or reorders a field, deserialization between old and new versions will fail. Additionally, `MasterClient::Connect` performs a strict `server_version != client_version` check that blocks any mixed-version communication.

**Proposal**:

1. **Adopt `struct_pack::compatible<T, version>`** for all new RPC struct fields. Older deserializers skip unknown `compatible` fields; newer deserializers treat missing ones as null. No transport layer changes needed.
2. **Relax version check**: modify `ServiceReady` to return a structured response including `protocol_version` (major version integer). This is a one-time breaking change as part of establishing the compatibility baseline. Replace the strict equality check with a major-version check: connection is allowed when both sides share the same major version.
3. **Define `MOONCAKE_PROTOCOL_VERSION`** in `version.h.in`. Within the same major version, all struct changes must be additive-only. Breaking changes require bumping the major version.

### 2. Master Graceful StepDown API

Current HA failover relies on passive lease expiration. Rolling upgrades need active, controlled leader transfer.

**Proposal**:

Add `kSteppingDown` to `MasterRuntimeState` (rejects new writes, allows reads, drains in-flight writes). Expose via `POST /api/v1/admin/stepdown`.

**Master Upgrade Flow**: Start new-version Standby → wait for `IsReadyForPromotion()` → StepDown old Primary → new Standby wins election → Clients auto re-mount → remove old container.

### 3. Version-Aware Drain Scheduling

During rolling upgrades, `SelectDrainTargetForKey` may migrate data to old-version nodes that will themselves be drained later, causing redundant data movement.

**Proposal**:

1. **Client version reporting**: add `client_version` as a parameter to `MountSegment`. Master stores a `client_id → version` mapping.
2. **Version-aware target selection**: add a `prefer_newer_targets` flag to `CreateDrainJobRequest`. When enabled, `SelectDrainTargetForKey` filters candidates by utilization threshold, then prefers higher `client_version` among eligible candidates, with fallback to lowest utilization.

**Storage Client Upgrade Flow**: Start new-version Client → Drain old node's segments (with `prefer_newer_targets=true`) → old node `GracefulShutdown` → remove old container → repeat.

### 4. Drain Progress Enhancement

`QueryJobResponse` lacks `total_units`, making progress tracking difficult.

**Proposal**: Add `total_units` and `progress_percent` to `QueryJobResponse`. `total_units` is computed at Job creation time and is stable (DRAINING segments reject new writes). Expose the existing `QuerySegmentStatus` RPC as `GET /api/v1/segments/status`.

### 5. Client Graceful Shutdown

No structured shutdown sequence exists for Storage Clients.

**Proposal**: Add `GracefulShutdown` to the `Client` class: stop `FetchTasks` → wait for DRAINING segments to reach DRAINED (with timeout) → wait for in-flight transfers → `UnmountSegment` → disconnect. For standalone processes, add `GET /healthz/ready` (200 OK / 503 draining).

### 6. Client Reconnection Robustness

When Master switches, all Clients simultaneously flood the new Master with `ReMountSegment` requests (thundering herd).

**Proposal**: Add jitter and exponential backoff to the reconnection logic when `NEED_REMOUNT` is detected. All parameters configurable via `ClientConfig`.

## Risks and Mitigations

| Risk | Mitigation |
|---|---|
| `compatible<T>` only supports additive changes | Enforce via code review; breaking changes require major version bump |
| Data unavailability during Drain | `DRAINING` segments remain readable; `MoveTask` is atomic |
| Drain timeout or failure | Built-in retry (`kMaxDrainUnitRetries`); Cancel API for rollback |
| Thundering herd on Master failover | Reconnection backoff with jitter (6) |

## Rollback Strategy

- **Master**: StepDown the failed new Master (or let its lease expire); old-version Standby wins the next election.
- **Storage Client**: Drain data off the failed node (or rely on replica redundancy); start old-version replacement.
- **Version mismatch**: Master rejects connections with mismatched major version; Client logs error and does not retry.

## Implementation Phases

- **Phase 1 (Foundation)**: 1 Forward-compatible RPC serialization (prerequisite for all others). 6 Reconnection robustness can be done in parallel.
- **Phase 2 (Master Upgrade)**: 2 StepDown API.
- **Phase 3 (Storage Client Upgrade)**: 3 Version-aware Drain + 4 Drain progress + 5 Graceful shutdown.

## Non-Goals

- Cross-major-version rolling upgrades (e.g., 2.x → 3.x)
- Kubernetes Operator / Controller integration
- Transfer Engine wire protocol changes (assumed stable across minor versions)


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues and read the [documentation](https://kvcache-ai.github.io/Mooncake/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Rolling Upgrade Support for Mooncake Store #1920

Changes proposed

Summary

Motivation

Scope

Current Capabilities

Master HA

Compute Client

Storage Client Data Migration (Drain)

Serialization

Proposed Changes

1. Forward-Compatible RPC Serialization

2. Master Graceful StepDown API

3. Version-Aware Drain Scheduling

4. Drain Progress Enhancement

5. Client Graceful Shutdown

6. Client Reconnection Robustness

Risks and Mitigations

Rollback Strategy

Implementation Phases

Non-Goals

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Risk	Mitigation
`compatible<T>` only supports additive changes	Enforce via code review; breaking changes require major version bump
Data unavailability during Drain	`DRAINING` segments remain readable; `MoveTask` is atomic
Drain timeout or failure	Built-in retry (`kMaxDrainUnitRetries`); Cancel API for rollback
Thundering herd on Master failover	Reconnection backoff with jitter (6)

[RFC]: Rolling Upgrade Support for Mooncake Store #1920

Description

Changes proposed

Summary

Motivation

Scope

Current Capabilities

Master HA

Compute Client

Storage Client Data Migration (Drain)

Serialization

Proposed Changes

1. Forward-Compatible RPC Serialization

2. Master Graceful StepDown API

3. Version-Aware Drain Scheduling

4. Drain Progress Enhancement

5. Client Graceful Shutdown

6. Client Reconnection Robustness

Risks and Mitigations

Rollback Strategy

Implementation Phases

Non-Goals

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions