Serialize concurrent ConfigureIndexesAsync with distributed lock + cache marker#244
Serialize concurrent ConfigureIndexesAsync with distributed lock + cache marker#244
Conversation
…k + cache marker Prevent redundant Elasticsearch admin API calls when multiple distributed processes call ConfigureIndexesAsync simultaneously on startup.
There was a problem hiding this comment.
Pull request overview
Adds concurrency protection to Elasticsearch index configuration to prevent redundant admin calls when many distributed processes call ConfigureIndexesAsync at startup, using a distributed lock plus a short-lived cache marker.
Changes:
- Serialize
ElasticConfiguration.ConfigureIndexesAsyncwith a distributed lock and “configured recently” cache marker. - Clear the cache marker after index delete, maintenance, and reindex operations so subsequent configure runs re-validate.
- Convert several methods to
asyncandawaitthe underlyingTask.WhenAllto support the new post-work cache invalidation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| await Task.WhenAll(tasks).AnyContext(); | ||
|
|
||
| await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext(); |
There was a problem hiding this comment.
Cache marker removal only runs if all DeleteAsync tasks complete successfully. If deletion partially succeeds and one task faults, Task.WhenAll will throw and the marker won't be cleared, so later ConfigureIndexesAsync calls may skip even though the index set changed. Consider removing the marker in a finally block (or before deletions start) to guarantee invalidation on failures too.
| await Task.WhenAll(tasks).AnyContext(); | |
| await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext(); | |
| try | |
| { | |
| await Task.WhenAll(tasks).AnyContext(); | |
| } | |
| finally | |
| { | |
| await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext(); | |
| } |
| if (await Cache.ExistsAsync(ConfigureIndexesCacheKey).AnyContext()) | ||
| return; | ||
|
|
||
| await using var configLock = await _lockProvider.AcquireAsync(ConfigureIndexesCacheKey, | ||
| TimeSpan.FromMinutes(1), TimeSpan.FromMinutes(1)).AnyContext(); | ||
|
|
||
| if (configLock is not null) | ||
| { | ||
| if (await Cache.ExistsAsync(ConfigureIndexesCacheKey).AnyContext()) | ||
| return; | ||
| } | ||
| else | ||
| { | ||
| _logger.LogWarning("Unable to acquire configure-indexes lock, continuing without lock"); | ||
| } | ||
|
|
||
| var allTasks = new List<Task>(); | ||
| foreach (var idx in Indexes) | ||
| allTasks.Add(ConfigureIndexInternalAsync(idx, beginReindexingOutdated)); | ||
|
|
||
| await Task.WhenAll(allTasks).AnyContext(); | ||
|
|
||
| return Task.WhenAll(tasks); | ||
| await Cache.SetAsync(ConfigureIndexesCacheKey, true, TimeSpan.FromMinutes(5)).AnyContext(); |
There was a problem hiding this comment.
ConfigureIndexesAsync now depends on distributed cache operations (ExistsAsync/SetAsync). If the cache client throws (e.g., Redis outage), ConfigureIndexesAsync will fail even though configuring indexes is otherwise idempotent and should be able to proceed. The PR description says this should fall back to the pre-lock behavior when cache is down—consider wrapping the marker check/set in try/catch (log warning) and treating cache failures as a cache miss / best-effort set.
| var tasks = new List<Task>(); | ||
| foreach (var idx in indexes) | ||
| tasks.Add(idx.MaintainAsync()); | ||
|
|
||
| return Task.WhenAll(tasks); | ||
| await Task.WhenAll(tasks).AnyContext(); | ||
|
|
||
| await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext(); |
There was a problem hiding this comment.
Cache marker removal only runs if all MaintainAsync tasks complete successfully. If any index maintenance fails (or partially succeeds), the marker may remain set and subsequent ConfigureIndexesAsync calls could incorrectly skip reconfiguration. Consider clearing the cache marker in a finally block (or before starting maintenance) so it's removed even when Task.WhenAll throws.
…idation Wrap all cache operations (ExistsAsync, SetAsync, RemoveAsync) in try/catch so cache outages degrade gracefully to pre-lock behavior instead of failing ConfigureIndexesAsync entirely. Move cache marker removal in DeleteIndexesAsync and MaintainIndexesAsync to finally blocks so partial failures still invalidate the marker. Extract InvalidateConfigureIndexesCacheAsync helper for consistent best-effort cache removal.
| catch (Exception ex) | ||
| { | ||
| _logger.LogWarning(ex, "Error checking configure-indexes cache marker, will configure indexes"); | ||
| } |
| catch (Exception ex) | ||
| { | ||
| _logger.LogWarning(ex, "Error acquiring configure-indexes lock, continuing without lock"); | ||
| } |
| catch (Exception ex) | ||
| { | ||
| _logger.LogWarning(ex, "Error checking configure-indexes cache marker after lock, will configure indexes"); | ||
| } |
| catch (Exception ex) | ||
| { | ||
| _logger.LogWarning(ex, "Error setting configure-indexes cache marker"); | ||
| } |
| catch (Exception ex) | ||
| { | ||
| _logger.LogWarning(ex, "Error removing configure-indexes cache marker"); | ||
| } |
TLDR
When multiple distributed processes (pods, workers, migration runners) call \ConfigureIndexesAsync\ on startup, only the first one does real work. Everyone else waits and skips. Cache marker is cleared on delete, reindex, and maintenance so the next configure call re-validates.
Problem
\ConfigureIndexesAsync\ had no concurrency protection. Every caller ran the full configure pass (index exists checks, create/update settings, put mappings, alias management, reindex detection) regardless of whether another process was already doing the same work. At scale (100 nodes, 100+ indexes), this means thousands of redundant Elasticsearch admin API calls on every deployment, plus racing reindex work item enqueuing.
Solution
Added a distributed double-checked lock + cache marker pattern to \ElasticConfiguration.ConfigureIndexesAsync:
Cache marker is explicitly cleared by \DeleteIndexesAsync, \ReindexAsync, and \MaintainIndexesAsync\ so the next configure call re-runs after any structural change.
Behavior at scale
Failure modes
Design decisions
Changes