[Backport v0.11] Fix replica escalation during rolling updates with local recommender#342
Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intov0.11from Apr 9, 2026
Conversation
…334) Fix replica escalation during rolling updates with local recommender During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback loop that escalates replicas on every reconciliation cycle: 1. Status.Replicas = Spec.Replicas + 1 (surge pod) 2. WPA sends Status.Replicas as CurrentReplicas to the recommender 3. Recommender returns "hold at CurrentReplicas" (metric between watermarks) 4. WPA compares response against Spec.Replicas, sees an upscale 5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats Fix: Send Spec.Replicas (the intended replica count) instead of Status.Replicas (which includes transient surge pods) as CurrentReplicas to the recommender. This aligns the recommender's input with what adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319), breaking the feedback loop. When there is no rolling update, Spec.Replicas == Status.Replicas so this change has no effect. The recommender still receives ReadyReplicas via CurrentReadyReplicas if it needs the actual running pod count. Fixes: #333 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: steven.blumenthal <steven.blumenthal@datadoghq.com> (cherry picked from commit 70e8425)
sblumenthal
approved these changes
Apr 9, 2026
9311548
into
v0.11
45 of 61 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport 70e8425 from #334.
Summary
Fixes a feedback loop where the WPA escalates replica count on every reconciliation cycle during rolling updates when using a local recommender. The WPA goes from steady-state (e.g., 18 replicas) to
maxReplicas(e.g., 40+) during routine deployments, even when the metric is stable and within watermark bounds.Fixes #333
The Bug
During a rolling update with
maxSurge >= 1, Kubernetes maintainsStatus.Replicas = Spec.Replicas + maxSurge. The WPA sendsStatus.ReplicasasCurrentReplicasto the recommender, butadjustReplicaCountcompares the recommender's response againstSpec.Replicas. This creates a +1 escalation on every cycle:Observed Production Impact (authenticator, us1.prod.dog, 2026-03-16)
Dashboards:
The Fix
Send
Spec.Replicasinstead ofStatus.ReplicasasCurrentReplicasto the recommender (replica_calculator.go:283).Status.Replicasincludes transient surge pods that Kubernetes creates during rolling updates.Spec.Replicasis the intended replica count — and it is already whatadjustReplicaCountuses as its baseline (line 319). Sending the same value to the recommender breaks the feedback loop.When there is no rolling update in progress,
Spec.Replicas == Status.Replicas, so this change has no effect in steady state. The recommender still receives the actual running pod count viaCurrentReadyReplicasif it needs it.Why not a post-hoc clamp?
An earlier approach added a clamp after
adjustReplicaCountto suppress the false-upscale result. That approach fixed the symptom but left the recommender receiving stale/inflated input during every rollout. SendingSpec.Replicasis the correct fix at the source: the recommender gets accurate input and returns an accurate recommendation.