-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
What steps did you take and what happened?
If a MachineSet is removed from the cluster (via force-deletion or while paused) and
any of these Machines receive a subsequent delete request, the machine controller fails to process them because it requires the owner MachineSet to exist before initiating reconciliation, even for the deletion path.
In this case the Machines remain in the cluster indefinitely with a non-nil DeletionTimestamp.
This is a change in behavior (regression?) introduced with the refactor that moved the UpToDate condition to the Machine controller (PR #12959), which increased the Machine controller’s reliance on the owner MachineSet. The underlying issue is that the controller does not treat "owner MachineSet not found" as a case it can handle (e.g. during deletion), so deletion is blocked whenever the parent is already gone.
Steps to reproduce
- Create a Cluster and scale a MachineSet to 1 so that a Machine exists.
- Remove the MachineSet from the cluster, for example by:
- Pausing the MachineSet and then deleting it, or
- Force-deleting it: kubectl delete machineset --force --grace-period=0
- Trigger deletion of the Machine (e.g. kubectl delete machine or let the controller try to delete children).
- The Machine gets a DeletionTimestamp but is never removed from the cluster. finalizers are never removed.
What did you expect to happen?
When a Machine has a DeletionTimestamp set, the Machine controller should run the deletion flow (drain, delete node, delete infra/bootstrap refs, remove finalizers) so the Machine is removed from the cluster. If the parent MachineSet was already removed (e.g. force-deleted), the controller should treat the Machine as effectively orphaned and still complete its deletion, rather than failing indefinitely because the owner MachineSet can no longer be found.
Cluster API version
v1.12.0+
Kubernetes version
v1.34.0
Anything else you would like to add?
If a MachineSet is removed from the cluster (via force-deletion or while paused), any remaining child Machines may become stuck. If these Machines receive a subsequent delete request, the controller fails to process them because it requires the owner MachineSet to exist before initiating reconciliation, even for the deletion path.
In this case the Machines remain in the cluster indefinitely with a non-nil DeletionTimestamp.
The Machine controller builds a scope at the start of every reconciliation and always loads the owner MachineSet when the Machine has an OwnerReference to one. In internal/controllers/machine/machine_controller.go
Around line:
- 237–241:
s.owningMachineSet, err = r.getOwnerMachineSet(ctx, s.machine); if err != nil, the function returns and no further reconciliation runs. - 292–302: getOwnerMachineSet uses
r.Client.Get(ctx, *machineSetKey, ms). If the MachineSet is gone (e.g. force-deleted),Get()returnsNotFoundand that error is returned without special handling. - 272–279: The actual deletion logic (reconcileDelete) is only run after the scope is built. So when loading the owner MachineSet fails, the controller never reaches the deletion path.
So when the parent MachineSet no longer exists, the controller fails early on "failed to retrieve owner MachineSet" and never runs the logic that would remove finalizers and complete Machine deletion. The Machine is left with a DeletionTimestamp and never gets deleted.
Label(s) to be applied
/kind bug
/area machine
/area machineset