Fix target scaler fallback race condition on multi-worker plans by alrod · Pull Request #3190 · Azure/azure-webjobs-sdk

alrod · 2026-04-02T00:02:20Z

Problem

On Elastic Premium plans, when a target scaler throws NotSupportedException (e.g. ServiceBus without Manage claim), the fallback to incremental scale monitoring fails if the Scale Controller and primary host run on different workers.

_targetScalersInError is a static in-process field. Only the admin/host/scale/status code path (called by the Scale Controller) invokes target scalers and populates this set. The ScaleMonitorService.TakeMetricsSamplesAsync() path (running on the primary host) reads the set but never populates it. When these run on different workers, the primary host's copy stays empty — it never discovers the failure, excludes the incremental monitor, collects no metrics, and produces no scale-in votes. The app stays stuck at max workers indefinitely.

Fix

Extract TryGetScaleResultAsync — single method that calls GetScaleResultAsync, catches NotSupportedException, and adds the scaler to _targetScalersInError. Eliminates code duplication between GetTargetScalersResultAsync and the new probe in ScaleMonitorService.
Probe target scalers in ScaleMonitorService — TakeMetricsSamplesAsync now periodically calls TryGetScaleResultAsync on target scalers so the primary host independently discovers failures. Runs on first tick, then every 10 minutes to balance detection speed vs overhead.
Thread safety — replaced HashSet<string> with ConcurrentDictionary<string, byte> for _targetScalersInError. The read path (ContainsKey in GetScalersToSample) and write path (TryAdd in TryGetScaleResultAsync) can now execute concurrently without locks.
Cleanup — removed redundant re-initialization of _targetScalersInError in the constructor (field is already initialized at declaration; ScaleManager is a singleton so the constructor only runs once, but the assignment was unnecessary).

Testing

Updated existing GetScalersToSample_FallsBackToMonitor_OnTargetScalerError assertion to match updated log message format.
Added OnTimer_ProbesTargetScalers_FallsBackToMonitor — verifies that ScaleMonitorService independently detects a faulty target scaler and falls back to incremental monitoring.
All 27 scale tests pass (21 ScaleManager + 6 ScaleMonitorService).

ScaleMonitorService.TakeMetricsSamplesAsync never invoked target scalers, so on multi-worker EP plans where the Scale Controller and primary host run on different workers, the primary host's _targetScalersInError stayed empty. This caused it to exclude the incremental monitor, collect no metrics, and produce no scale-in votes - leaving apps stuck at max workers. - Extract TryGetScaleResultAsync to eliminate duplication of the NotSupportedException catch/fallback logic - Periodically probe target scalers in ScaleMonitorService (first tick, then every 10 minutes) so the primary host independently discovers failures - Remove redundant re-initialization of _targetScalersInError in constructor Co-authored-by: Dobby <dobby@microsoft.com>

Reset _nextTargetScalerValidationTime to DateTime.MinValue in test setup alongside _targetScalersInError.Clear() to prevent flaky test failures from test execution order. Co-authored-by: Dobby <dobby@microsoft.com>

…ersInError The read path (Contains) in GetScalersToSample was unlocked while the write path (Add) in TryGetScaleResultAsync was locked. With two writer paths now (GetScaleStatus and ScaleMonitorService), concurrent read+write risk increases. ConcurrentDictionary removes the need for locks entirely. Co-authored-by: Dobby <dobby@microsoft.com>

mathewc · 2026-04-07T20:38:39Z

src/Microsoft.Azure.WebJobs.Host/Scale/ScaleMonitorService.cs

+                // even when the Scale Controller runs on a different worker.
+                // Runs on first tick, then every 10 minutes to avoid per-tick overhead
+                // while still catching runtime permission changes (e.g. managed identity revocation).
+                if (targetScalersToSample.Any() && DateTime.UtcNow >= _nextTargetScalerValidationTime)


This method is all about taking metrics samples for incremental scaling. It doesn't make any sense to be making TBS calls here. Also, this approach makes a bunch of service calls whose results are just thrown away.

Instead, why don't we address the fundamental issue which is the _targetScalersInError collection as you're using it is supposed to be shared state across N workers, similar to metrics samples and snapshots. So one thing you could do is store that info in storage in the same way so its accessible by all instances.

Thanks @mathewc. A couple of points on the probe approach:

Service call frequency — the probe wasn't per-tick. It ran on first tick, then cached for 10 minutes (_nextTargetScalerValidationTime). So it was one TBS call per scaler per 10 minutes, not "a bunch." For comparison, the concurrency snapshot (_concurrencyStatusRepository.ReadAsync) does a blob storage read on every SC call (~10s) with no caching.

Storage-backed approach — I agree it's architecturally cleaner. I've implemented it in Fix target scaler fallback race condition via shared storage #3191 following the BlobStorageConcurrencyStatusRepository / NullConcurrencyStatusRepository pattern. It is a significantly bigger change (~700 lines vs ~150), but it fixes the root cause and keeps TBS calls out of the metrics sampling path.

alrod and others added 3 commits April 1, 2026 17:01

Fix test isolation for static _nextTargetScalerValidationTime

1ad438e

Reset _nextTargetScalerValidationTime to DateTime.MinValue in test setup alongside _targetScalersInError.Clear() to prevent flaky test failures from test execution order. Co-authored-by: Dobby <dobby@microsoft.com>

mathewc reviewed Apr 7, 2026

View reviewed changes

alrod mentioned this pull request Apr 8, 2026

Fix target scaler fallback race condition via shared storage #3191

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix target scaler fallback race condition on multi-worker plans#3190

Fix target scaler fallback race condition on multi-worker plans#3190
alrod wants to merge 3 commits intodevfrom
alrod/fix/target-scaler-race-condition

alrod commented Apr 2, 2026 •

edited

Loading

Uh oh!

mathewc Apr 7, 2026

Uh oh!

alrod Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alrod commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

mathewc Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

alrod Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alrod commented Apr 2, 2026 •

edited

Loading

alrod Apr 9, 2026 •

edited

Loading