Summary
After a Velero-based restore (backup/restore), the ClickHouse operator sometimes stops making progress on a ClickHouseInstallation (CHI): the CHI remains InProgress with HOSTS-COMPLETED empty even though the ClickHouse pod is Running. Operator logs stop advancing. Restarting the operator pod restores normal reconciliation.
We also see Kubernetes client informer traces indicating slow event handlers and non-trivial DeltaFIFO depth (~51), which suggests the operator’s shared work queue may be backing up or wedged after the restore storm / transient RBAC failures.
We are reporting this upstream in case it is a known class of issues (informer backlog, error handling after Forbidden/RBAC, or reconciliation blocking the shared queue).
Steps to reproduce
- Take a backup with Velero
- Restore the backup into a target cluster/namespace.
- Observe restore ordering:
- Operator pod comes up before ServiceAccount / Role / RoleBinding for the operator in some cases (delay < ~1 minute).
- During that window, the operator logs authentication/authorization errors (cannot list/watch
CHI, STS, PVC, etc.).
- After RBAC objects appear, errors stop and the ClickHouse pod from the CHI can become
Running.
- Bug: The CHI
STATUS stays InProgress; HOSTS-COMPLETED does not advance; no further meaningful operator logs for the reconciliation path; operator appears stalled.
- Workaround: Delete/restart the operator pod; reconciliation resumes and the CHI completes as expected.
Again:
- In Velero, changed ordering in the restore to create role,rolebinding,sa first.
- The operator came up and role/rolebinding/sa was created beforehand and didn't notice the RABC issue.
- However, the Operator appears stalled, CHI stays
InProgress
Expected behavior
- After transient RBAC or API errors during restore, the operator should recover and eventually mark the CHI complete without requiring a manual operator restart.
- If the work queue is temporarily overloaded, we would still expect progress or clear error/retry signals, not a silent stall with a stuck CHI status.
Actual behavior
NAME STATUS CLUSTERS HOSTS HOSTS-COMPLETED AGE clickhouse InProgress 1 1 ~10m+
- ClickHouse pod state (example):
chi-clickhouse-default-0-0-0 1/1 Running.
- Operator pod:
2/2 Running, but no further useful logs after the stall begins.
Operator logs
I0424 06:13:50.985141 1 worker-config-map.go:70] updateConfigMap():CHI:default/clickhouse:Update ConfigMap default/chi-clickhouse-common-usersd
I0424 06:13:58.594193 1 trace.go:236] Trace[1021980655]: "DeltaFIFO Pop Process" ID:default/temporal-frontend-headless-mqr5w,Depth:51,Reason:slow event handlers blocking the queue (24-Apr-2026 06:13:58.491) (total time: 102ms):
Trace[1021980655]: [102.62776ms] [102.62776ms] END
I0424 06:27:58.786132 1 trace.go:236] Trace[2079076590]: "DeltaFIFO Pop Process" ID:default/kube-state-metrics,Depth:51,Reason:slow event handlers blocking the queue (24-Apr-2026 06:27:58.683) (total time: 101ms):
Trace[2079076590]: [101.600123ms] [101.600123ms] END
I0424 06:27:58.887442 1 trace.go:236] Trace[580084308]: "DeltaFIFO Pop Process" ID:default/tia-plus.app-change-pmgcv,Depth:23,Reason:slow event handlers blocking the queue (24-Apr-2026 06:27:58.786) (total time: 100ms):
Trace[580084308]: [1
Every 2.0s: kubectl get chi; kubectl get po | grep clickhouse
NAME STATUS CLUSTERS HOSTS HOSTS-COMPLETED AGE SUSPEND
clickhouse. InProgress 1 1 10m
chi-clickhouse-default-0-0-0 1/1 Running 0 10m
clickhouse-operator-774bbf94dd-vkns9 2/2 Running 0 15m
Workaround
Restart the clickhouse-operator deployment pod (or rollout restart). After restart, the CHI reconciles and completes.
Ask
- Is this a known interaction with Velero restore ordering / transient RBAC?
- Any recommended settings (workers, resync, timeouts) or version that mitigates informer backlog?
- If this matches a bug on your side, we are happy to collect goroutine dumps, full operator logs, or CHI/STS manifests as you specify.
Summary
After a Velero-based restore (backup/restore), the ClickHouse operator sometimes stops making progress on a
ClickHouseInstallation(CHI): the CHI remainsInProgresswithHOSTS-COMPLETEDempty even though the ClickHouse pod isRunning. Operator logs stop advancing. Restarting the operator pod restores normal reconciliation.We also see Kubernetes client informer traces indicating slow event handlers and non-trivial DeltaFIFO depth (~51), which suggests the operator’s shared work queue may be backing up or wedged after the restore storm / transient RBAC failures.
We are reporting this upstream in case it is a known class of issues (informer backlog, error handling after
Forbidden/RBAC, or reconciliation blocking the shared queue).Steps to reproduce
CHI,STS,PVC, etc.).Running.STATUSstaysInProgress;HOSTS-COMPLETEDdoes not advance; no further meaningful operator logs for the reconciliation path; operator appears stalled.Again:
InProgressExpected behavior
Actual behavior
NAME STATUS CLUSTERS HOSTS HOSTS-COMPLETED AGE clickhouse InProgress 1 1 ~10m+
chi-clickhouse-default-0-0-01/1 Running.2/2 Running, but no further useful logs after the stall begins.Operator logs
Workaround
Restart the
clickhouse-operatordeployment pod (or rollout restart). After restart, the CHI reconciles and completes.Ask