-
Notifications
You must be signed in to change notification settings - Fork 537
Description
Summary
During replica scale-down (replicasCount: 2 → 1), the operator executes SYSTEM DROP REPLICA before the deleted replica's Keeper session has expired. The command fails with Code: 305, Can't drop replica ... because it's active, and the operator does not retry. This leaves stale replica metadata (znodes) in ClickHouse Keeper permanently.
Environment
- Operator version: 0.26.0
- ClickHouse version: 24.8 (Altinity Stable)
- Database engine:
AtomicwithReplicatedMergeTreetables (38 tables) - Keeper session_timeout_ms: 120000 (CHI spec)
- Cluster layout: 2 shards × 2 replicas
Steps to Reproduce
- Deploy a CHI with
replicasCount: 2 - Scale down to
replicasCount: 1 - Observe operator logs
Observed Behavior
Kubernetes resource deletion (Pod, StatefulSet, Service) completes normally. Then:
07:03:43Z Drop replica: chi-dev-botmanager-bmch-0-1 at 0-0
07:03:43Z FAILED to drop replica on host 0-1, err: Code: 305, Can't drop replica ..., because it's active
07:03:44Z Drop replica: chi-dev-botmanager-bmch-1-1 at 1-0
07:03:44Z FAILED to drop replica on host 1-1, err: Code: 305, Can't drop replica ..., because it's active
07:03:44Z FAILED single try. No retries will be made
07:03:44Z processed replicas: 2
Timeline: Pods deleted at 07:02:08Z–07:02:18Z, DROP REPLICA attempted at 07:03:43Z — approximately 86–95 seconds after pod termination, which is less than the configured session_timeout_ms of 120000.
Result: After reconcile completes, the deleted replicas become inactive in system.replicas, but their znodes remain in Keeper:
SELECT path, groupArray(name) FROM system.zookeeper
WHERE path LIKE '%/replicas'
AND path LIKE '%botmanager%'
GROUP BY path;
-- Both deleted replica names (bmch-0-1, bmch-1-1) still present as children
-- 38 stale paths per deleted replica (one per ReplicatedMergeTree table)system.replicas shows total_replicas = 2 instead of expected 1.
Root Cause
In pkg/controller/chi/worker-deleter.go, dropZKReplicas() calls dropZKReplica() for each removed host. The underlying HostDropReplica() in pkg/model/chi/schemer/schemer.go uses SetRetry(false):
// Single attempt, no retry on failure
opts.SetRetry(false)There is no wait for the replica to become inactive before executing SYSTEM DROP REPLICA. The regular scale-down path in dropZKReplicas iterates removed hosts and immediately attempts the drop.
reconcile.host.wait.replicas.delay does not apply to this code path — it controls replication lag wait, not post-deletion session cleanup.
Expected Behavior
The operator should either:
- Wait for
replica_is_active = 0(pollsystem.replicas) before executingSYSTEM DROP REPLICA, or - Retry on
Code: 305 (because it's active)with backoff, at least forsession_timeout_ms + bufferduration
Workaround
Reducing session_timeout_ms from 120000 to 60000 in the CHI spec narrows the race window enough that the operator's drop timing falls after session expiry.
We verified this with live experiments on a dev cluster:
| session_timeout_ms | DROP REPLICA result |
|---|---|
| 120000 | ❌ because it's active (reproduced twice) |
| 60000 | ✅ Success (verified twice consecutively) |
For cases where the race is still hit, manual cleanup is required:
-- Run on surviving replica of each shard
SYSTEM DROP REPLICA 'chi-dev-botmanager-bmch-0-1';
SYSTEM DROP REPLICA 'chi-dev-botmanager-bmch-1-1';Related
- ZooKeeper entries not cleaned up when a shard is removed #1927 — ZK entries not cleaned on shard removal (different code path:
shardFuncis no-op, whereas replica removal does invokehostFuncbut fails due to this timing issue)