-
Notifications
You must be signed in to change notification settings - Fork 538
Description
Bug Description
When the ClickHouseKeeperInstallation (CHK) CR spec has not changed but the operator is restarted (e.g. due to an operator config change), the CHK reconciler detects that existing StatefulSets have different object version labels (indicating they need updating), but then skips the entire host reconciliation path — including PVC label/annotation reconciliation — because the ActionPlan has no actions.
This means PVC labels and annotations are never updated to match the desired state after an operator restart or config change, unless the CHK CR spec itself is also modified.
Steps to Reproduce
- Deploy a
ClickHouseKeeperInstallationwith PVCs (e.g. a 3-node keeper cluster) - Modify the operator configuration (e.g.
ClickHouseOperatorConfiguration) in a way that changes the desired labels/annotations on PVCs - The operator pod restarts and begins reconciling
- Observe the keeper PVCs retain their old labels/annotations
Expected Behavior
After operator restart, the CHK reconciler should reconcile PVC labels and annotations to match the desired state, even when the CHK CR spec itself has not changed.
Actual Behavior
The CHK reconciler:
- Detects that StatefulSets are DIFFERENT based on object version labels:
object-status.go:54 cur and new objects ARE DIFFERENT based on object version label: Update of the object is required. Object: clickhouse/keeper-langfuse-db-0 - But then aborts because the ActionPlan (which diffs the CR spec, not child resources) has no actions:
worker-reconciler-chk.go:95 ActionPlan has no actions - abort reconcile - PVC reconciliation never runs because it is only reachable via
reconcileHostMain()→reconcileHostPVCs(), which is gated behind the ActionPlan check.
Root Cause Analysis
In worker-reconciler-chk.go (v0.26.0), the reconcileCR() function at line 91 checks:
case new.EnsureRuntime().ActionPlan.HasActionsToDo():
w.a.M(new).F().Info("ActionPlan has actions - continue reconcile")The ActionPlan is built by api.MakeActionPlan(cr.GetAncestorT(), cr) which compares the CR spec (ancestor vs current). If the CR spec has not changed, the ActionPlan has no actions, and the reconciler returns early at line 97.
The PVC reconciliation logic in storage-reconciler.go (specifically the reconcilePVC function at line 283 which calls TagPVC + UpdateOrCreate) is only reachable through the host reconciliation path: reconcile() → reconcileClusterShardsAndHosts() → reconcileHost() → reconcileHostMain() → reconcileHostPVCs(). This entire path is skipped when the ActionPlan gate returns early.
Note that the CHI (ClickHouseInstallation) reconciler appears to have the same structural pattern but may be less affected in practice because CHI spec changes are more common.
Suggested Fix
Consider one of:
- Always run PVC label/annotation reconciliation regardless of ActionPlan status (as a lightweight pass)
- Factor child resource drift (StatefulSet version label mismatches, PVC label/annotation mismatches) into the ActionPlan decision
- Add a separate reconciliation path for "metadata-only" updates (labels, annotations) that runs unconditionally
Environment
- Operator version: 0.26.0
- Kubernetes: AWS EKS
- Resource type:
ClickHouseKeeperInstallation
Workaround
Patching the taskID field on the CHK CR to force a full reconcile cycle:
kubectl patch chk <name> -n <namespace> --type=merge -p '{"spec":{"taskID":"force-reconcile-<timestamp>"}}'