Skip to content

[Backport v0.11] Fix replica escalation during rolling updates with local recommender#342

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intov0.11from
backport-334-to-v0.11
Apr 9, 2026
Merged

[Backport v0.11] Fix replica escalation during rolling updates with local recommender#342
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit intov0.11from
backport-334-to-v0.11

Conversation

@dd-octo-sts
Copy link
Copy Markdown

@dd-octo-sts dd-octo-sts bot commented Apr 9, 2026

Backport 70e8425 from #334.


Summary

Fixes a feedback loop where the WPA escalates replica count on every reconciliation cycle during rolling updates when using a local recommender. The WPA goes from steady-state (e.g., 18 replicas) to maxReplicas (e.g., 40+) during routine deployments, even when the metric is stable and within watermark bounds.

Fixes #333

The Bug

During a rolling update with maxSurge >= 1, Kubernetes maintains Status.Replicas = Spec.Replicas + maxSurge. The WPA sends Status.Replicas as CurrentReplicas to the recommender, but adjustReplicaCount compares the recommender's response against Spec.Replicas. This creates a +1 escalation on every cycle:

1. Spec=18, Status=19 (surge pod)
2. WPA sends CurrentReplicas=19 to recommender
3. Metric is between watermarks → recommender returns "hold at 19"
4. adjustReplicaCount compares 19 (recommended) vs 18 (Spec) → upscale!
5. WPA sets Spec=19. K8s creates surge pod → Status=20
6. Next cycle: recommender says "hold at 20", 20 > 19 → upscale to 20
7. Repeat until maxReplicas

Observed Production Impact (authenticator, us1.prod.dog, 2026-03-16)

  • Replicas escalated 18 → 40 in ~5 minutes during a routine deploy
  • Old-version pods scaled 18 → 50 via K8s proportional scaling — all throwaway
  • CPU was stable at ~15% (well within 10%–30% watermarks)
  • ~320 CPU cores wasted for 1+ hour (downscale delay)

Dashboards:

The Fix

Send Spec.Replicas instead of Status.Replicas as CurrentReplicas to the recommender (replica_calculator.go:283).

-CurrentReplicas: target.Status.Replicas,
+CurrentReplicas: target.Spec.Replicas,

Status.Replicas includes transient surge pods that Kubernetes creates during rolling updates. Spec.Replicas is the intended replica count — and it is already what adjustReplicaCount uses as its baseline (line 319). Sending the same value to the recommender breaks the feedback loop.

When there is no rolling update in progress, Spec.Replicas == Status.Replicas, so this change has no effect in steady state. The recommender still receives the actual running pod count via CurrentReadyReplicas if it needs it.

Why not a post-hoc clamp?

An earlier approach added a clamp after adjustReplicaCount to suppress the false-upscale result. That approach fixed the symptom but left the recommender receiving stale/inflated input during every rollout. Sending Spec.Replicas is the correct fix at the source: the recommender gets accurate input and returns an accurate recommendation.

…334)

Fix replica escalation during rolling updates with local recommender

During rolling updates with maxSurge >= 1, the WPA enters a +1 feedback
loop that escalates replicas on every reconciliation cycle:

1. Status.Replicas = Spec.Replicas + 1 (surge pod)
2. WPA sends Status.Replicas as CurrentReplicas to the recommender
3. Recommender returns "hold at CurrentReplicas" (metric between watermarks)
4. WPA compares response against Spec.Replicas, sees an upscale
5. Sets Spec.Replicas = Status.Replicas, K8s creates new surge pod, loop repeats

Fix: Send Spec.Replicas (the intended replica count) instead of
Status.Replicas (which includes transient surge pods) as CurrentReplicas
to the recommender. This aligns the recommender's input with what
adjustReplicaCount uses as its baseline (target.Spec.Replicas at line 319),
breaking the feedback loop.

When there is no rolling update, Spec.Replicas == Status.Replicas so
this change has no effect. The recommender still receives ReadyReplicas
via CurrentReadyReplicas if it needs the actual running pod count.

Fixes: #333

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: steven.blumenthal <steven.blumenthal@datadoghq.com>
(cherry picked from commit 70e8425)
@dd-octo-sts dd-octo-sts bot requested a review from a team as a code owner April 9, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants