Skip to content

Fix replica escalation during rolling updates with local recommender#334

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intoDataDog:mainfrom
piyushjindal-dd:piyush.jindal/fix-rolling-update-replica-escalation
Apr 9, 2026
Merged

Fix replica escalation during rolling updates with local recommender#334
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intoDataDog:mainfrom
piyushjindal-dd:piyush.jindal/fix-rolling-update-replica-escalation

Conversation

@piyushjindal-dd
Copy link
Copy Markdown
Contributor

@piyushjindal-dd piyushjindal-dd commented Mar 21, 2026

Summary

Fixes a feedback loop where the WPA escalates replica count on every reconciliation cycle during rolling updates when using a local recommender. The WPA goes from steady-state (e.g., 18 replicas) to maxReplicas (e.g., 40+) during routine deployments, even when the metric is stable and within watermark bounds.

Fixes #333

The Bug

During a rolling update with maxSurge >= 1, Kubernetes maintains Status.Replicas = Spec.Replicas + maxSurge. The WPA sends Status.Replicas as CurrentReplicas to the recommender, but adjustReplicaCount compares the recommender's response against Spec.Replicas. This creates a +1 escalation on every cycle:

1. Spec=18, Status=19 (surge pod)
2. WPA sends CurrentReplicas=19 to recommender
3. Metric is between watermarks → recommender returns "hold at 19"
4. adjustReplicaCount compares 19 (recommended) vs 18 (Spec) → upscale!
5. WPA sets Spec=19. K8s creates surge pod → Status=20
6. Next cycle: recommender says "hold at 20", 20 > 19 → upscale to 20
7. Repeat until maxReplicas

Observed Production Impact (authenticator, us1.prod.dog, 2026-03-16)

  • Replicas escalated 18 → 40 in ~5 minutes during a routine deploy
  • Old-version pods scaled 18 → 50 via K8s proportional scaling — all throwaway
  • CPU was stable at ~15% (well within 10%–30% watermarks)
  • ~320 CPU cores wasted for 1+ hour (downscale delay)

Dashboards:

The Fix

Send Spec.Replicas instead of Status.Replicas as CurrentReplicas to the recommender (replica_calculator.go:283).

-CurrentReplicas: target.Status.Replicas,
+CurrentReplicas: target.Spec.Replicas,

Status.Replicas includes transient surge pods that Kubernetes creates during rolling updates. Spec.Replicas is the intended replica count — and it is already what adjustReplicaCount uses as its baseline (line 319). Sending the same value to the recommender breaks the feedback loop.

When there is no rolling update in progress, Spec.Replicas == Status.Replicas, so this change has no effect in steady state. The recommender still receives the actual running pod count via CurrentReadyReplicas if it needs it.

Why not a post-hoc clamp?

An earlier approach added a clamp after adjustReplicaCount to suppress the false-upscale result. That approach fixed the symptom but left the recommender receiving stale/inflated input during every rollout. Sending Spec.Replicas is the correct fix at the source: the recommender gets accurate input and returns an accurate recommendation.

During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback
loop that escalates replicas on every reconciliation cycle:

1. Status.Replicas = Spec.Replicas + 1 (surge pod)
2. WPA sends Status.Replicas as CurrentReplicas to the recommender
3. Recommender returns "hold at CurrentReplicas" (metric between watermarks)
4. WPA compares response against Spec.Replicas, sees an upscale
5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats

Fix: Send Spec.Replicas (the intended replica count) instead of
Status.Replicas (which includes transient surge pods) as CurrentReplicas
to the recommender. This aligns the recommender's input with what
adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319),
breaking the feedback loop.

When there is no rolling update, Spec.Replicas == Status.Replicas so
this change has no effect. The recommender still receives ReadyReplicas
via CurrentReadyReplicas if it needs the actual running pod count.

Fixes: DataDog#333

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@piyushjindal-dd
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

value.With(labelsWithMetricName).Set(float64(utilizationQuantity))

replicaCount, metricPos := adjustReplicaCount(logger, target.Spec.Replicas, currentReadyReplicas, wpa, int32(reco.Replicas), int32(reco.ReplicasLowerBound), int32(reco.ReplicasUpperBound))
recommendedReplicas := int32(reco.Replicas)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another simpler option - but this would be a breaking change as the contract with the clients would change. Not sure if we that is okay. If so, we can do that instead.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned on slack, I can't find any reference to this contract in the docs or git history... I think this was more likely an oversight, possibly because the original change introduced over 2 years ago didn't test with maxSurge configured

I think this approach would be cleaner, so I think we should proceed with this one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, reverted the pr to simpler approach.

@piyushjindal-dd piyushjindal-dd marked this pull request as ready for review March 21, 2026 19:32
@piyushjindal-dd piyushjindal-dd requested a review from a team as a code owner March 21, 2026 19:32
@piyushjindal-dd piyushjindal-dd marked this pull request as draft March 23, 2026 14:53
@piyushjindal-dd
Copy link
Copy Markdown
Contributor Author

@clamoriniere @sblumenthal @vboulineau since you are listed as maintainers for WPA can I get feedback from your on this. Currently due to this our deployments are bit messy.

@piyushjindal-dd piyushjindal-dd marked this pull request as ready for review March 26, 2026 15:14
@sblumenthal sblumenthal self-assigned this Apr 1, 2026
@piyushjindal-dd piyushjindal-dd force-pushed the piyush.jindal/fix-rolling-update-replica-escalation branch from 41113fa to a77d70d Compare April 8, 2026 14:37
@sblumenthal sblumenthal added the bug Something isn't working label Apr 9, 2026
@sblumenthal sblumenthal added this to the v0.11.0 milestone Apr 9, 2026
@sblumenthal sblumenthal modified the milestone: v0.11.0 Apr 9, 2026
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d bot merged commit 70e8425 into DataDog:main Apr 9, 2026
26 of 32 checks passed
dd-octo-sts bot pushed a commit that referenced this pull request Apr 9, 2026
…334)

Fix replica escalation during rolling updates with local recommender

During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback
loop that escalates replicas on every reconciliation cycle:

1. Status.Replicas = Spec.Replicas + 1 (surge pod)
2. WPA sends Status.Replicas as CurrentReplicas to the recommender
3. Recommender returns "hold at CurrentReplicas" (metric between watermarks)
4. WPA compares response against Spec.Replicas, sees an upscale
5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats

Fix: Send Spec.Replicas (the intended replica count) instead of
Status.Replicas (which includes transient surge pods) as CurrentReplicas
to the recommender. This aligns the recommender's input with what
adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319),
breaking the feedback loop.

When there is no rolling update, Spec.Replicas == Status.Replicas so
this change has no effect. The recommender still receives ReadyReplicas
via CurrentReadyReplicas if it needs the actual running pod count.

Fixes: #333

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-authored-by: steven.blumenthal <[email protected]>
(cherry picked from commit 70e8425)
gh-worker-dd-mergequeue-cf854d bot pushed a commit that referenced this pull request Apr 9, 2026
…334) (#342)

Fix replica escalation during rolling updates with local recommender (#334)

Fix replica escalation during rolling updates with local recommender

During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback
loop that escalates replicas on every reconciliation cycle:

1. Status.Replicas = Spec.Replicas + 1 (surge pod)
2. WPA sends Status.Replicas as CurrentReplicas to the recommender
3. Recommender returns "hold at CurrentReplicas" (metric between watermarks)
4. WPA compares response against Spec.Replicas, sees an upscale
5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats

Fix: Send Spec.Replicas (the intended replica count) instead of
Status.Replicas (which includes transient surge pods) as CurrentReplicas
to the recommender. This aligns the recommender's input with what
adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319),
breaking the feedback loop.

When there is no rolling update, Spec.Replicas == Status.Replicas so
this change has no effect. The recommender still receives ReadyReplicas
via CurrentReadyReplicas if it needs the actual running pod count.

Fixes: #333

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-authored-by: steven.blumenthal <[email protected]>
(cherry picked from commit 70e8425)

Co-authored-by: piyushjindal-dd <[email protected]>
Co-authored-by: steven.blumenthal <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WPA escalates replicas during rolling updates due to Status.Replicas vs Spec.Replicas mismatch

2 participants