[Backport v0.11] Fix replica escalation during rolling updates with local recommender by dd-octo-sts[bot] · Pull Request #342 · DataDog/watermarkpodautoscaler

dd-octo-sts · 2026-04-09T15:26:04Z

Backport 70e8425 from #334.

Summary

Fixes a feedback loop where the WPA escalates replica count on every reconciliation cycle during rolling updates when using a local recommender. The WPA goes from steady-state (e.g., 18 replicas) to maxReplicas (e.g., 40+) during routine deployments, even when the metric is stable and within watermark bounds.

Fixes #333

The Bug

During a rolling update with maxSurge >= 1, Kubernetes maintains Status.Replicas = Spec.Replicas + maxSurge. The WPA sends Status.Replicas as CurrentReplicas to the recommender, but adjustReplicaCount compares the recommender's response against Spec.Replicas. This creates a +1 escalation on every cycle:

1. Spec=18, Status=19 (surge pod)
2. WPA sends CurrentReplicas=19 to recommender
3. Metric is between watermarks → recommender returns &quot;hold at 19&quot;
4. adjustReplicaCount compares 19 (recommended) vs 18 (Spec) → upscale!
5. WPA sets Spec=19. K8s creates surge pod → Status=20
6. Next cycle: recommender says &quot;hold at 20&quot;, 20 &gt; 19 → upscale to 20
7. Repeat until maxReplicas

Observed Production Impact (authenticator, us1.prod.dog, 2026-03-16)

Replicas escalated 18 → 40 in ~5 minutes during a routine deploy
Old-version pods scaled 18 → 50 via K8s proportional scaling — all throwaway
CPU was stable at ~15% (well within 10%–30% watermarks)
~320 CPU cores wasted for 1+ hour (downscale delay)

Dashboards:

The Fix

Send Spec.Replicas instead of Status.Replicas as CurrentReplicas to the recommender (replica_calculator.go:283).

-CurrentReplicas: target.Status.Replicas,
+CurrentReplicas: target.Spec.Replicas,

Status.Replicas includes transient surge pods that Kubernetes creates during rolling updates. Spec.Replicas is the intended replica count — and it is already what adjustReplicaCount uses as its baseline (line 319). Sending the same value to the recommender breaks the feedback loop.

When there is no rolling update in progress, Spec.Replicas == Status.Replicas, so this change has no effect in steady state. The recommender still receives the actual running pod count via CurrentReadyReplicas if it needs it.

Why not a post-hoc clamp?

An earlier approach added a clamp after adjustReplicaCount to suppress the false-upscale result. That approach fixed the symptom but left the recommender receiving stale/inflated input during every rollout. Sending Spec.Replicas is the correct fix at the source: the recommender gets accurate input and returns an accurate recommendation.

…334) Fix replica escalation during rolling updates with local recommender During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback loop that escalates replicas on every reconciliation cycle: 1. Status.Replicas = Spec.Replicas + 1 (surge pod) 2. WPA sends Status.Replicas as CurrentReplicas to the recommender 3. Recommender returns "hold at CurrentReplicas" (metric between watermarks) 4. WPA compares response against Spec.Replicas, sees an upscale 5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats Fix: Send Spec.Replicas (the intended replica count) instead of Status.Replicas (which includes transient surge pods) as CurrentReplicas to the recommender. This aligns the recommender's input with what adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319), breaking the feedback loop. When there is no rolling update, Spec.Replicas == Status.Replicas so this change has no effect. The recommender still receives ReadyReplicas via CurrentReadyReplicas if it needs the actual running pod count. Fixes: #333 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: steven.blumenthal <steven.blumenthal@datadoghq.com> (cherry picked from commit 70e8425)

dd-octo-sts bot requested a review from a team as a code owner April 9, 2026 15:26

dd-octo-sts bot added bug Something isn't working mergequeue-status: done backport bot labels Apr 9, 2026

sblumenthal approved these changes Apr 9, 2026

View reviewed changes

gh-worker-dd-devflow-36fce6 bot added mergequeue-status: queued mergequeue-status: in_progress mergequeue-status: rejected and removed mergequeue-status: done mergequeue-status: queued mergequeue-status: in_progress mergequeue-status: rejected labels Apr 9, 2026

gh-worker-dd-mergequeue-cf854d bot merged commit 9311548 into v0.11 Apr 9, 2026
45 of 61 checks passed

gh-worker-dd-devflow-36fce6 bot added mergequeue-status: done and removed mergequeue-status: in_progress labels Apr 9, 2026

gh-worker-dd-mergequeue-cf854d bot deleted the backport-334-to-v0.11 branch April 9, 2026 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport v0.11] Fix replica escalation during rolling updates with local recommender#342

[Backport v0.11] Fix replica escalation during rolling updates with local recommender#342
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intov0.11from
backport-334-to-v0.11

dd-octo-sts bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dd-octo-sts bot commented Apr 9, 2026

Summary

The Bug

Observed Production Impact (authenticator, us1.prod.dog, 2026-03-16)

The Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants