Fix target scaler fallback race condition on multi-worker plans#3190
Fix target scaler fallback race condition on multi-worker plans#3190
Conversation
ScaleMonitorService.TakeMetricsSamplesAsync never invoked target scalers, so on multi-worker EP plans where the Scale Controller and primary host run on different workers, the primary host's _targetScalersInError stayed empty. This caused it to exclude the incremental monitor, collect no metrics, and produce no scale-in votes - leaving apps stuck at max workers. - Extract TryGetScaleResultAsync to eliminate duplication of the NotSupportedException catch/fallback logic - Periodically probe target scalers in ScaleMonitorService (first tick, then every 10 minutes) so the primary host independently discovers failures - Remove redundant re-initialization of _targetScalersInError in constructor Co-authored-by: Dobby <dobby@microsoft.com>
Reset _nextTargetScalerValidationTime to DateTime.MinValue in test setup alongside _targetScalersInError.Clear() to prevent flaky test failures from test execution order. Co-authored-by: Dobby <dobby@microsoft.com>
…ersInError The read path (Contains) in GetScalersToSample was unlocked while the write path (Add) in TryGetScaleResultAsync was locked. With two writer paths now (GetScaleStatus and ScaleMonitorService), concurrent read+write risk increases. ConcurrentDictionary removes the need for locks entirely. Co-authored-by: Dobby <dobby@microsoft.com>
| // even when the Scale Controller runs on a different worker. | ||
| // Runs on first tick, then every 10 minutes to avoid per-tick overhead | ||
| // while still catching runtime permission changes (e.g. managed identity revocation). | ||
| if (targetScalersToSample.Any() && DateTime.UtcNow >= _nextTargetScalerValidationTime) |
There was a problem hiding this comment.
This method is all about taking metrics samples for incremental scaling. It doesn't make any sense to be making TBS calls here. Also, this approach makes a bunch of service calls whose results are just thrown away.
Instead, why don't we address the fundamental issue which is the _targetScalersInError collection as you're using it is supposed to be shared state across N workers, similar to metrics samples and snapshots. So one thing you could do is store that info in storage in the same way so its accessible by all instances.
There was a problem hiding this comment.
Thanks @mathewc. A couple of points on the probe approach:
-
Service call frequency — the probe wasn't per-tick. It ran on first tick, then cached for 10 minutes (
_nextTargetScalerValidationTime). So it was one TBS call per scaler per 10 minutes, not "a bunch." For comparison, the concurrency snapshot (_concurrencyStatusRepository.ReadAsync) does a blob storage read on every SC call (~10s) with no caching. -
Storage-backed approach — I agree it's architecturally cleaner. I've implemented it in Fix target scaler fallback race condition via shared storage #3191 following the
BlobStorageConcurrencyStatusRepository/NullConcurrencyStatusRepositorypattern. It is a significantly bigger change (~700 lines vs ~150), but it fixes the root cause and keeps TBS calls out of the metrics sampling path.
Problem
On Elastic Premium plans, when a target scaler throws
NotSupportedException(e.g. ServiceBus without Manage claim), the fallback to incremental scale monitoring fails if the Scale Controller and primary host run on different workers._targetScalersInErroris a static in-process field. Only theadmin/host/scale/statuscode path (called by the Scale Controller) invokes target scalers and populates this set. TheScaleMonitorService.TakeMetricsSamplesAsync()path (running on the primary host) reads the set but never populates it. When these run on different workers, the primary host's copy stays empty — it never discovers the failure, excludes the incremental monitor, collects no metrics, and produces no scale-in votes. The app stays stuck at max workers indefinitely.Fix
Extract
TryGetScaleResultAsync— single method that callsGetScaleResultAsync, catchesNotSupportedException, and adds the scaler to_targetScalersInError. Eliminates code duplication betweenGetTargetScalersResultAsyncand the new probe inScaleMonitorService.Probe target scalers in
ScaleMonitorService—TakeMetricsSamplesAsyncnow periodically callsTryGetScaleResultAsyncon target scalers so the primary host independently discovers failures. Runs on first tick, then every 10 minutes to balance detection speed vs overhead.Thread safety — replaced
HashSet<string>withConcurrentDictionary<string, byte>for_targetScalersInError. The read path (ContainsKeyinGetScalersToSample) and write path (TryAddinTryGetScaleResultAsync) can now execute concurrently without locks.Cleanup — removed redundant re-initialization of
_targetScalersInErrorin the constructor (field is already initialized at declaration;ScaleManageris a singleton so the constructor only runs once, but the assignment was unnecessary).Testing
GetScalersToSample_FallsBackToMonitor_OnTargetScalerErrorassertion to match updated log message format.OnTimer_ProbesTargetScalers_FallsBackToMonitor— verifies thatScaleMonitorServiceindependently detects a faulty target scaler and falls back to incremental monitoring.