Skip to content

SYSTEM DROP REPLICA fails during replica scale-down due to no retry — stale ZK metadata left behind #1943

@ihyeokss

Description

@ihyeokss

Summary

During replica scale-down (replicasCount: 2 → 1), the operator executes SYSTEM DROP REPLICA before the deleted replica's Keeper session has expired. The command fails with Code: 305, Can't drop replica ... because it's active, and the operator does not retry. This leaves stale replica metadata (znodes) in ClickHouse Keeper permanently.

Environment

  • Operator version: 0.26.0
  • ClickHouse version: 24.8 (Altinity Stable)
  • Database engine: Atomic with ReplicatedMergeTree tables (38 tables)
  • Keeper session_timeout_ms: 120000 (CHI spec)
  • Cluster layout: 2 shards × 2 replicas

Steps to Reproduce

  1. Deploy a CHI with replicasCount: 2
  2. Scale down to replicasCount: 1
  3. Observe operator logs

Observed Behavior

Kubernetes resource deletion (Pod, StatefulSet, Service) completes normally. Then:

07:03:43Z Drop replica: chi-dev-botmanager-bmch-0-1 at 0-0
07:03:43Z FAILED to drop replica on host 0-1, err: Code: 305, Can't drop replica ..., because it's active
07:03:44Z Drop replica: chi-dev-botmanager-bmch-1-1 at 1-0
07:03:44Z FAILED to drop replica on host 1-1, err: Code: 305, Can't drop replica ..., because it's active
07:03:44Z FAILED single try. No retries will be made
07:03:44Z processed replicas: 2

Timeline: Pods deleted at 07:02:08Z07:02:18Z, DROP REPLICA attempted at 07:03:43Z — approximately 86–95 seconds after pod termination, which is less than the configured session_timeout_ms of 120000.

Result: After reconcile completes, the deleted replicas become inactive in system.replicas, but their znodes remain in Keeper:

SELECT path, groupArray(name) FROM system.zookeeper
WHERE path LIKE '%/replicas'
  AND path LIKE '%botmanager%'
GROUP BY path;

-- Both deleted replica names (bmch-0-1, bmch-1-1) still present as children
-- 38 stale paths per deleted replica (one per ReplicatedMergeTree table)

system.replicas shows total_replicas = 2 instead of expected 1.

Root Cause

In pkg/controller/chi/worker-deleter.go, dropZKReplicas() calls dropZKReplica() for each removed host. The underlying HostDropReplica() in pkg/model/chi/schemer/schemer.go uses SetRetry(false):

// Single attempt, no retry on failure
opts.SetRetry(false)

There is no wait for the replica to become inactive before executing SYSTEM DROP REPLICA. The regular scale-down path in dropZKReplicas iterates removed hosts and immediately attempts the drop.

reconcile.host.wait.replicas.delay does not apply to this code path — it controls replication lag wait, not post-deletion session cleanup.

Expected Behavior

The operator should either:

  1. Wait for replica_is_active = 0 (poll system.replicas) before executing SYSTEM DROP REPLICA, or
  2. Retry on Code: 305 (because it's active) with backoff, at least for session_timeout_ms + buffer duration

Workaround

Reducing session_timeout_ms from 120000 to 60000 in the CHI spec narrows the race window enough that the operator's drop timing falls after session expiry.

We verified this with live experiments on a dev cluster:

session_timeout_ms DROP REPLICA result
120000 because it's active (reproduced twice)
60000 ✅ Success (verified twice consecutively)

For cases where the race is still hit, manual cleanup is required:

-- Run on surviving replica of each shard
SYSTEM DROP REPLICA 'chi-dev-botmanager-bmch-0-1';
SYSTEM DROP REPLICA 'chi-dev-botmanager-bmch-1-1';

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions