Fix replica escalation during rolling updates with local recommender by piyushjindal-dd · Pull Request #334 · DataDog/watermarkpodautoscaler

piyushjindal-dd · 2026-03-21T19:00:52Z

Summary

Fixes a feedback loop where the WPA escalates replica count on every reconciliation cycle during rolling updates when using a local recommender. The WPA goes from steady-state (e.g., 18 replicas) to maxReplicas (e.g., 40+) during routine deployments, even when the metric is stable and within watermark bounds.

Fixes #333

The Bug

During a rolling update with maxSurge >= 1, Kubernetes maintains Status.Replicas = Spec.Replicas + maxSurge. The WPA sends Status.Replicas as CurrentReplicas to the recommender, but adjustReplicaCount compares the recommender's response against Spec.Replicas. This creates a +1 escalation on every cycle:

1. Spec=18, Status=19 (surge pod)
2. WPA sends CurrentReplicas=19 to recommender
3. Metric is between watermarks → recommender returns "hold at 19"
4. adjustReplicaCount compares 19 (recommended) vs 18 (Spec) → upscale!
5. WPA sets Spec=19. K8s creates surge pod → Status=20
6. Next cycle: recommender says "hold at 20", 20 > 19 → upscale to 20
7. Repeat until maxReplicas

Observed Production Impact (authenticator, us1.prod.dog, 2026-03-16)

Replicas escalated 18 → 40 in ~5 minutes during a routine deploy
Old-version pods scaled 18 → 50 via K8s proportional scaling — all throwaway
CPU was stable at ~15% (well within 10%–30% watermarks)
~320 CPU cores wasted for 1+ hour (downscale delay)

Dashboards:

The Fix

Send Spec.Replicas instead of Status.Replicas as CurrentReplicas to the recommender (replica_calculator.go:283).

-CurrentReplicas: target.Status.Replicas,
+CurrentReplicas: target.Spec.Replicas,

Status.Replicas includes transient surge pods that Kubernetes creates during rolling updates. Spec.Replicas is the intended replica count — and it is already what adjustReplicaCount uses as its baseline (line 319). Sending the same value to the recommender breaks the feedback loop.

When there is no rolling update in progress, Spec.Replicas == Status.Replicas, so this change has no effect in steady state. The recommender still receives the actual running pod count via CurrentReadyReplicas if it needs it.

Why not a post-hoc clamp?

An earlier approach added a clamp after adjustReplicaCount to suppress the false-upscale result. That approach fixed the symptom but left the recommender receiving stale/inflated input during every rollout. Sending Spec.Replicas is the correct fix at the source: the recommender gets accurate input and returns an accurate recommendation.

During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback loop that escalates replicas on every reconciliation cycle: 1. Status.Replicas = Spec.Replicas + 1 (surge pod) 2. WPA sends Status.Replicas as CurrentReplicas to the recommender 3. Recommender returns "hold at CurrentReplicas" (metric between watermarks) 4. WPA compares response against Spec.Replicas, sees an upscale 5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats Fix: Send Spec.Replicas (the intended replica count) instead of Status.Replicas (which includes transient surge pods) as CurrentReplicas to the recommender. This aligns the recommender's input with what adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319), breaking the feedback loop. When there is no rolling update, Spec.Replicas == Status.Replicas so this change has no effect. The recommender still receives ReadyReplicas via CurrentReadyReplicas if it needs the actual running pod count. Fixes: DataDog#333 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

piyushjindal-dd · 2026-03-21T19:03:36Z

@codex review

chatgpt-codex-connector · 2026-03-21T19:03:43Z

To use Codex here, create a Codex account and connect to github.

piyushjindal-dd · 2026-03-21T19:31:49Z

controllers/datadoghq/replica_calculator.go

 	value.With(labelsWithMetricName).Set(float64(utilizationQuantity))

-	replicaCount, metricPos := adjustReplicaCount(logger, target.Spec.Replicas, currentReadyReplicas, wpa, int32(reco.Replicas), int32(reco.ReplicasLowerBound), int32(reco.ReplicasUpperBound))
+	recommendedReplicas := int32(reco.Replicas)


Another simpler option - but this would be a breaking change as the contract with the clients would change. Not sure if we that is okay. If so, we can do that instead.

As mentioned on slack, I can't find any reference to this contract in the docs or git history... I think this was more likely an oversight, possibly because the original change introduced over 2 years ago didn't test with maxSurge configured

I think this approach would be cleaner, so I think we should proceed with this one

Thanks, reverted the pr to simpler approach.

piyushjindal-dd · 2026-03-25T17:13:57Z

@clamoriniere @sblumenthal @vboulineau since you are listed as maintainers for WPA can I get feedback from your on this. Currently due to this our deployments are bit messy.

…334) Fix replica escalation during rolling updates with local recommender During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback loop that escalates replicas on every reconciliation cycle: 1. Status.Replicas = Spec.Replicas + 1 (surge pod) 2. WPA sends Status.Replicas as CurrentReplicas to the recommender 3. Recommender returns "hold at CurrentReplicas" (metric between watermarks) 4. WPA compares response against Spec.Replicas, sees an upscale 5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats Fix: Send Spec.Replicas (the intended replica count) instead of Status.Replicas (which includes transient surge pods) as CurrentReplicas to the recommender. This aligns the recommender's input with what adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319), breaking the feedback loop. When there is no rolling update, Spec.Replicas == Status.Replicas so this change has no effect. The recommender still receives ReadyReplicas via CurrentReadyReplicas if it needs the actual running pod count. Fixes: #333 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: steven.blumenthal <[email protected]> (cherry picked from commit 70e8425)

…334) (#342) Fix replica escalation during rolling updates with local recommender (#334) Fix replica escalation during rolling updates with local recommender During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback loop that escalates replicas on every reconciliation cycle: 1. Status.Replicas = Spec.Replicas + 1 (surge pod) 2. WPA sends Status.Replicas as CurrentReplicas to the recommender 3. Recommender returns "hold at CurrentReplicas" (metric between watermarks) 4. WPA compares response against Spec.Replicas, sees an upscale 5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats Fix: Send Spec.Replicas (the intended replica count) instead of Status.Replicas (which includes transient surge pods) as CurrentReplicas to the recommender. This aligns the recommender's input with what adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319), breaking the feedback loop. When there is no rolling update, Spec.Replicas == Status.Replicas so this change has no effect. The recommender still receives ReadyReplicas via CurrentReadyReplicas if it needs the actual running pod count. Fixes: #333 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]> Co-authored-by: steven.blumenthal <[email protected]> (cherry picked from commit 70e8425) Co-authored-by: piyushjindal-dd <[email protected]> Co-authored-by: steven.blumenthal <[email protected]>

piyushjindal-dd commented Mar 21, 2026

View reviewed changes

piyushjindal-dd marked this pull request as ready for review March 21, 2026 19:32

piyushjindal-dd requested a review from a team as a code owner March 21, 2026 19:32

piyushjindal-dd marked this pull request as draft March 23, 2026 14:53

piyushjindal-dd marked this pull request as ready for review March 26, 2026 15:14

sblumenthal self-assigned this Apr 1, 2026

piyushjindal-dd force-pushed the piyush.jindal/fix-rolling-update-replica-escalation branch from 41113fa to a77d70d Compare April 8, 2026 14:37

piyushjindal-dd requested a review from sblumenthal April 8, 2026 14:40

sblumenthal added the bug Something isn't working label Apr 9, 2026

sblumenthal added this to the v0.11.0 milestone Apr 9, 2026

sblumenthal approved these changes Apr 9, 2026

View reviewed changes

sblumenthal modified the milestone: v0.11.0 Apr 9, 2026

gh-worker-dd-devflow-36fce6 bot added mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: queued labels Apr 9, 2026

gh-worker-dd-mergequeue-cf854d bot merged commit 70e8425 into DataDog:main Apr 9, 2026
26 of 32 checks passed

gh-worker-dd-devflow-36fce6 bot added mergequeue-status: done and removed mergequeue-status: in_progress labels Apr 9, 2026

sblumenthal added the backport/v0.11 label Apr 9, 2026

dd-octo-sts bot mentioned this pull request Apr 9, 2026

[Backport v0.11] Fix replica escalation during rolling updates with local recommender #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix replica escalation during rolling updates with local recommender#334

Fix replica escalation during rolling updates with local recommender#334
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intoDataDog:mainfrom
piyushjindal-dd:piyush.jindal/fix-rolling-update-replica-escalation

piyushjindal-dd commented Mar 21, 2026 •

edited

Loading

Uh oh!

piyushjindal-dd commented Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 21, 2026

Uh oh!

piyushjindal-dd Mar 21, 2026

Uh oh!

sblumenthal Apr 7, 2026

Uh oh!

piyushjindal-dd Apr 8, 2026

Uh oh!

piyushjindal-dd commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

piyushjindal-dd commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The Bug

Observed Production Impact (authenticator, us1.prod.dog, 2026-03-16)

The Fix

Uh oh!

piyushjindal-dd commented Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 21, 2026

Uh oh!

piyushjindal-dd Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

sblumenthal Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

piyushjindal-dd Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

piyushjindal-dd commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

piyushjindal-dd commented Mar 21, 2026 •

edited

Loading