Skip to content

CHK reconciler skips PVC label/annotation reconciliation when ActionPlan has no spec-level changes #1934

@glichten

Description

@glichten

Bug Description

When the ClickHouseKeeperInstallation (CHK) CR spec has not changed but the operator is restarted (e.g. due to an operator config change), the CHK reconciler detects that existing StatefulSets have different object version labels (indicating they need updating), but then skips the entire host reconciliation path — including PVC label/annotation reconciliation — because the ActionPlan has no actions.

This means PVC labels and annotations are never updated to match the desired state after an operator restart or config change, unless the CHK CR spec itself is also modified.

Steps to Reproduce

  1. Deploy a ClickHouseKeeperInstallation with PVCs (e.g. a 3-node keeper cluster)
  2. Modify the operator configuration (e.g. ClickHouseOperatorConfiguration) in a way that changes the desired labels/annotations on PVCs
  3. The operator pod restarts and begins reconciling
  4. Observe the keeper PVCs retain their old labels/annotations

Expected Behavior

After operator restart, the CHK reconciler should reconcile PVC labels and annotations to match the desired state, even when the CHK CR spec itself has not changed.

Actual Behavior

The CHK reconciler:

  1. Detects that StatefulSets are DIFFERENT based on object version labels:
    object-status.go:54  cur and new objects ARE DIFFERENT based on object version label:
      Update of the object is required. Object: clickhouse/keeper-langfuse-db-0
    
  2. But then aborts because the ActionPlan (which diffs the CR spec, not child resources) has no actions:
    worker-reconciler-chk.go:95  ActionPlan has no actions - abort reconcile
    
  3. PVC reconciliation never runs because it is only reachable via reconcileHostMain()reconcileHostPVCs(), which is gated behind the ActionPlan check.

Root Cause Analysis

In worker-reconciler-chk.go (v0.26.0), the reconcileCR() function at line 91 checks:

case new.EnsureRuntime().ActionPlan.HasActionsToDo():
    w.a.M(new).F().Info("ActionPlan has actions - continue reconcile")

The ActionPlan is built by api.MakeActionPlan(cr.GetAncestorT(), cr) which compares the CR spec (ancestor vs current). If the CR spec has not changed, the ActionPlan has no actions, and the reconciler returns early at line 97.

The PVC reconciliation logic in storage-reconciler.go (specifically the reconcilePVC function at line 283 which calls TagPVC + UpdateOrCreate) is only reachable through the host reconciliation path: reconcile()reconcileClusterShardsAndHosts()reconcileHost()reconcileHostMain()reconcileHostPVCs(). This entire path is skipped when the ActionPlan gate returns early.

Note that the CHI (ClickHouseInstallation) reconciler appears to have the same structural pattern but may be less affected in practice because CHI spec changes are more common.

Suggested Fix

Consider one of:

  1. Always run PVC label/annotation reconciliation regardless of ActionPlan status (as a lightweight pass)
  2. Factor child resource drift (StatefulSet version label mismatches, PVC label/annotation mismatches) into the ActionPlan decision
  3. Add a separate reconciliation path for "metadata-only" updates (labels, annotations) that runs unconditionally

Environment

  • Operator version: 0.26.0
  • Kubernetes: AWS EKS
  • Resource type: ClickHouseKeeperInstallation

Workaround

Patching the taskID field on the CHK CR to force a full reconcile cycle:

kubectl patch chk <name> -n <namespace> --type=merge -p '{"spec":{"taskID":"force-reconcile-<timestamp>"}}'

Metadata

Metadata

Assignees

No one assigned

    Labels

    KeeperClickHouse Keeper issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions