SYSTEM DROP REPLICA fails during replica scale-down due to no retry — stale ZK metadata left behind

## Summary

During replica scale-down (`replicasCount: 2 → 1`), the operator executes `SYSTEM DROP REPLICA` before the deleted replica's Keeper session has expired. The command fails with `Code: 305, Can't drop replica ... because it's active`, and the operator does **not retry**. This leaves stale replica metadata (znodes) in ClickHouse Keeper permanently.

## Environment

- **Operator version**: 0.26.0
- **ClickHouse version**: 24.8 (Altinity Stable)
- **Database engine**: `Atomic` with `ReplicatedMergeTree` tables (38 tables)
- **Keeper session_timeout_ms**: 120000 (CHI spec)
- **Cluster layout**: 2 shards × 2 replicas

## Steps to Reproduce

1. Deploy a CHI with `replicasCount: 2`
2. Scale down to `replicasCount: 1`
3. Observe operator logs

## Observed Behavior

Kubernetes resource deletion (Pod, StatefulSet, Service) completes normally. Then:

```
07:03:43Z Drop replica: chi-dev-botmanager-bmch-0-1 at 0-0
07:03:43Z FAILED to drop replica on host 0-1, err: Code: 305, Can't drop replica ..., because it's active
07:03:44Z Drop replica: chi-dev-botmanager-bmch-1-1 at 1-0
07:03:44Z FAILED to drop replica on host 1-1, err: Code: 305, Can't drop replica ..., because it's active
07:03:44Z FAILED single try. No retries will be made
07:03:44Z processed replicas: 2
```

**Timeline**: Pods deleted at `07:02:08Z`–`07:02:18Z`, `DROP REPLICA` attempted at `07:03:43Z` — approximately **86–95 seconds** after pod termination, which is **less than** the configured `session_timeout_ms` of 120000.

**Result**: After reconcile completes, the deleted replicas become `inactive` in `system.replicas`, but their znodes remain in Keeper:

```sql
SELECT path, groupArray(name) FROM system.zookeeper
WHERE path LIKE '%/replicas'
  AND path LIKE '%botmanager%'
GROUP BY path;

-- Both deleted replica names (bmch-0-1, bmch-1-1) still present as children
-- 38 stale paths per deleted replica (one per ReplicatedMergeTree table)
```

`system.replicas` shows `total_replicas = 2` instead of expected `1`.

## Root Cause

In `pkg/controller/chi/worker-deleter.go`, `dropZKReplicas()` calls `dropZKReplica()` for each removed host. The underlying `HostDropReplica()` in `pkg/model/chi/schemer/schemer.go` uses `SetRetry(false)`:

```go
// Single attempt, no retry on failure
opts.SetRetry(false)
```

There is **no wait** for the replica to become inactive before executing `SYSTEM DROP REPLICA`. The regular scale-down path in `dropZKReplicas` iterates removed hosts and immediately attempts the drop.

`reconcile.host.wait.replicas.delay` does **not** apply to this code path — it controls replication lag wait, not post-deletion session cleanup.

## Expected Behavior

The operator should either:

1. **Wait** for `replica_is_active = 0` (poll `system.replicas`) before executing `SYSTEM DROP REPLICA`, or
2. **Retry** on `Code: 305 (because it's active)` with backoff, at least for `session_timeout_ms + buffer` duration

## Workaround

Reducing `session_timeout_ms` from `120000` to `60000` in the CHI spec narrows the race window enough that the operator's drop timing falls after session expiry.

We verified this with live experiments on a dev cluster:

| session_timeout_ms | DROP REPLICA result |
|---|---|
| 120000 | ❌ `because it's active` (reproduced twice) |
| 60000 | ✅ Success (verified twice consecutively) |

For cases where the race is still hit, manual cleanup is required:

```sql
-- Run on surviving replica of each shard
SYSTEM DROP REPLICA 'chi-dev-botmanager-bmch-0-1';
SYSTEM DROP REPLICA 'chi-dev-botmanager-bmch-1-1';
```

## Related

- #1927 — ZK entries not cleaned on shard removal (different code path: `shardFunc` is no-op, whereas replica removal does invoke `hostFunc` but fails due to this timing issue)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYSTEM DROP REPLICA fails during replica scale-down due to no retry — stale ZK metadata left behind #1943

Summary

Environment

Steps to Reproduce

Observed Behavior

Root Cause

Expected Behavior

Workaround

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

session_timeout_ms	DROP REPLICA result
120000	❌ `because it's active` (reproduced twice)
60000	✅ Success (verified twice consecutively)

SYSTEM DROP REPLICA fails during replica scale-down due to no retry — stale ZK metadata left behind #1943

Description

Summary

Environment

Steps to Reproduce

Observed Behavior

Root Cause

Expected Behavior

Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions