Skip to content

Fix target scaler fallback race condition on multi-worker plans#3190

Open
alrod wants to merge 3 commits intodevfrom
alrod/fix/target-scaler-race-condition
Open

Fix target scaler fallback race condition on multi-worker plans#3190
alrod wants to merge 3 commits intodevfrom
alrod/fix/target-scaler-race-condition

Conversation

@alrod
Copy link
Copy Markdown
Member

@alrod alrod commented Apr 2, 2026

Problem

On Elastic Premium plans, when a target scaler throws NotSupportedException (e.g. ServiceBus without Manage claim), the fallback to incremental scale monitoring fails if the Scale Controller and primary host run on different workers.

_targetScalersInError is a static in-process field. Only the admin/host/scale/status code path (called by the Scale Controller) invokes target scalers and populates this set. The ScaleMonitorService.TakeMetricsSamplesAsync() path (running on the primary host) reads the set but never populates it. When these run on different workers, the primary host's copy stays empty — it never discovers the failure, excludes the incremental monitor, collects no metrics, and produces no scale-in votes. The app stays stuck at max workers indefinitely.

Fix

  • Extract TryGetScaleResultAsync — single method that calls GetScaleResultAsync, catches NotSupportedException, and adds the scaler to _targetScalersInError. Eliminates code duplication between GetTargetScalersResultAsync and the new probe in ScaleMonitorService.

  • Probe target scalers in ScaleMonitorServiceTakeMetricsSamplesAsync now periodically calls TryGetScaleResultAsync on target scalers so the primary host independently discovers failures. Runs on first tick, then every 10 minutes to balance detection speed vs overhead.

  • Thread safety — replaced HashSet<string> with ConcurrentDictionary<string, byte> for _targetScalersInError. The read path (ContainsKey in GetScalersToSample) and write path (TryAdd in TryGetScaleResultAsync) can now execute concurrently without locks.

  • Cleanup — removed redundant re-initialization of _targetScalersInError in the constructor (field is already initialized at declaration; ScaleManager is a singleton so the constructor only runs once, but the assignment was unnecessary).

Testing

  • Updated existing GetScalersToSample_FallsBackToMonitor_OnTargetScalerError assertion to match updated log message format.
  • Added OnTimer_ProbesTargetScalers_FallsBackToMonitor — verifies that ScaleMonitorService independently detects a faulty target scaler and falls back to incremental monitoring.
  • All 27 scale tests pass (21 ScaleManager + 6 ScaleMonitorService).

alrod and others added 3 commits April 1, 2026 17:01
ScaleMonitorService.TakeMetricsSamplesAsync never invoked target scalers,
so on multi-worker EP plans where the Scale Controller and primary host
run on different workers, the primary host's _targetScalersInError stayed
empty. This caused it to exclude the incremental monitor, collect no
metrics, and produce no scale-in votes - leaving apps stuck at max workers.

- Extract TryGetScaleResultAsync to eliminate duplication of the
  NotSupportedException catch/fallback logic
- Periodically probe target scalers in ScaleMonitorService (first tick,
  then every 10 minutes) so the primary host independently discovers
  failures
- Remove redundant re-initialization of _targetScalersInError in
  constructor

Co-authored-by: Dobby <dobby@microsoft.com>
Reset _nextTargetScalerValidationTime to DateTime.MinValue in test setup
alongside _targetScalersInError.Clear() to prevent flaky test failures
from test execution order.

Co-authored-by: Dobby <dobby@microsoft.com>
…ersInError

The read path (Contains) in GetScalersToSample was unlocked while
the write path (Add) in TryGetScaleResultAsync was locked. With
two writer paths now (GetScaleStatus and ScaleMonitorService),
concurrent read+write risk increases. ConcurrentDictionary removes
the need for locks entirely.

Co-authored-by: Dobby <dobby@microsoft.com>
// even when the Scale Controller runs on a different worker.
// Runs on first tick, then every 10 minutes to avoid per-tick overhead
// while still catching runtime permission changes (e.g. managed identity revocation).
if (targetScalersToSample.Any() && DateTime.UtcNow >= _nextTargetScalerValidationTime)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is all about taking metrics samples for incremental scaling. It doesn't make any sense to be making TBS calls here. Also, this approach makes a bunch of service calls whose results are just thrown away.

Instead, why don't we address the fundamental issue which is the _targetScalersInError collection as you're using it is supposed to be shared state across N workers, similar to metrics samples and snapshots. So one thing you could do is store that info in storage in the same way so its accessible by all instances.

Copy link
Copy Markdown
Member Author

@alrod alrod Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mathewc. A couple of points on the probe approach:

  1. Service call frequency — the probe wasn't per-tick. It ran on first tick, then cached for 10 minutes (_nextTargetScalerValidationTime). So it was one TBS call per scaler per 10 minutes, not "a bunch." For comparison, the concurrency snapshot (_concurrencyStatusRepository.ReadAsync) does a blob storage read on every SC call (~10s) with no caching.

  2. Storage-backed approach — I agree it's architecturally cleaner. I've implemented it in Fix target scaler fallback race condition via shared storage #3191 following the BlobStorageConcurrencyStatusRepository / NullConcurrencyStatusRepository pattern. It is a significantly bigger change (~700 lines vs ~150), but it fixes the root cause and keeps TBS calls out of the metrics sampling path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants