Operator stalls after Velero restore; CHI stuck InProgress until operator pod restart — DeltaFIFO “slow event handlers” / high queue depth

### Summary
After a Velero-based restore (backup/restore), the ClickHouse operator sometimes stops making progress on a `ClickHouseInstallation` (CHI): the CHI remains **`InProgress`** with **`HOSTS-COMPLETED` empty** even though the ClickHouse pod is **`Running`**. Operator logs stop advancing. Restarting the operator pod restores normal reconciliation.

We also see **Kubernetes client informer** traces indicating **slow event handlers** and **non-trivial DeltaFIFO depth** (~51), which suggests the operator’s shared work queue may be backing up or wedged after the restore storm / transient RBAC failures.

We are reporting this upstream in case it is a known class of issues (informer backlog, error handling after `Forbidden`/RBAC, or reconciliation blocking the shared queue).

### Steps to reproduce
1. Take a backup with Velero
2. Restore the backup into a target cluster/namespace.
3. Observe restore ordering:
   - Operator pod comes up **before** ServiceAccount / Role / RoleBinding for the operator in some cases (delay **&lt; ~1 minute**).
4. During that window, the operator logs **authentication/authorization errors** (cannot list/watch `CHI`, `STS`, `PVC`, etc.).
5. After RBAC objects appear, errors stop and the ClickHouse pod from the CHI can become **`Running`**.
6. **Bug:** The CHI **`STATUS` stays `InProgress`**; **`HOSTS-COMPLETED` does not advance**; **no further meaningful operator logs** for the reconciliation path; operator appears **stalled**.
7. **Workaround:** Delete/restart the operator pod; reconciliation resumes and the CHI completes as expected.

Again: 
1. In Velero, changed ordering in the restore to create role,rolebinding,sa first.
2. The operator came up and role/rolebinding/sa was created beforehand and didn't notice the RABC issue.
3. However, the Operator appears **stalled**, CHI stays `InProgress`


### Expected behavior
- After transient RBAC or API errors during restore, the operator should **recover** and eventually **mark the CHI complete** without requiring a manual operator restart.
- If the work queue is temporarily overloaded, we would still expect **progress** or **clear error/retry signals**, not a silent stall with a stuck CHI status.


### Actual behavior
- CHI example while stuck:


NAME STATUS CLUSTERS HOSTS HOSTS-COMPLETED AGE clickhouse InProgress 1 1 ~10m+

- ClickHouse pod state (example): `chi-clickhouse-default-0-0-0` **`1/1 Running`**.
- Operator pod: **`2/2 Running`**, but **no further useful logs** after the stall begins.

### Operator logs 

```
I0424 06:13:50.985141       1 worker-config-map.go:70] updateConfigMap():CHI:default/clickhouse:Update ConfigMap default/chi-clickhouse-common-usersd
I0424 06:13:58.594193       1 trace.go:236] Trace[1021980655]: "DeltaFIFO Pop Process" ID:default/temporal-frontend-headless-mqr5w,Depth:51,Reason:slow event handlers blocking the queue (24-Apr-2026 06:13:58.491) (total time: 102ms):
Trace[1021980655]: [102.62776ms] [102.62776ms] END

I0424 06:27:58.786132       1 trace.go:236] Trace[2079076590]: "DeltaFIFO Pop Process" ID:default/kube-state-metrics,Depth:51,Reason:slow event handlers blocking the queue (24-Apr-2026 06:27:58.683) (total time: 101ms):
Trace[2079076590]: [101.600123ms] [101.600123ms] END
I0424 06:27:58.887442       1 trace.go:236] Trace[580084308]: "DeltaFIFO Pop Process" ID:default/tia-plus.app-change-pmgcv,Depth:23,Reason:slow event handlers blocking the queue (24-Apr-2026 06:27:58.786) (total time: 100ms):
Trace[580084308]: [1
```

```
Every 2.0s: kubectl get chi; kubectl  get po | grep clickhouse

NAME                 STATUS       CLUSTERS   HOSTS   HOSTS-COMPLETED   AGE   SUSPEND
clickhouse.       InProgress               1          1                         10m

chi-clickhouse-default-0-0-0                        1/1     Running            0             10m
clickhouse-operator-774bbf94dd-vkns9     2/2     Running            0             15m
```

### Workaround
Restart the `clickhouse-operator` deployment pod (or rollout restart). After restart, the CHI reconciles and completes.


### Ask
1. Is this a **known** interaction with **Velero restore ordering** / **transient RBAC**?
2. Any **recommended settings** (workers, resync, timeouts) or **version** that mitigates informer backlog?
3. If this matches a bug on your side, we are happy to **collect goroutine dumps**, **full operator logs**, or **CHI/STS manifests** as you specify.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator stalls after Velero restore; CHI stuck InProgress until operator pod restart — DeltaFIFO “slow event handlers” / high queue depth #1969

Summary

Steps to reproduce

Expected behavior

Actual behavior

Operator logs

Workaround

Ask

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operator stalls after Velero restore; CHI stuck InProgress until operator pod restart — DeltaFIFO “slow event handlers” / high queue depth #1969

Description

Summary

Steps to reproduce

Expected behavior

Actual behavior

Operator logs

Workaround

Ask

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions