Skip to content

Serialize concurrent ConfigureIndexesAsync with distributed lock + cache marker#244

Open
niemyjski wants to merge 2 commits intomainfrom
feature/serialize-index-configuration
Open

Serialize concurrent ConfigureIndexesAsync with distributed lock + cache marker#244
niemyjski wants to merge 2 commits intomainfrom
feature/serialize-index-configuration

Conversation

@niemyjski
Copy link
Copy Markdown
Member

TLDR

When multiple distributed processes (pods, workers, migration runners) call \ConfigureIndexesAsync\ on startup, only the first one does real work. Everyone else waits and skips. Cache marker is cleared on delete, reindex, and maintenance so the next configure call re-validates.

Problem

\ConfigureIndexesAsync\ had no concurrency protection. Every caller ran the full configure pass (index exists checks, create/update settings, put mappings, alias management, reindex detection) regardless of whether another process was already doing the same work. At scale (100 nodes, 100+ indexes), this means thousands of redundant Elasticsearch admin API calls on every deployment, plus racing reindex work item enqueuing.

Solution

Added a distributed double-checked lock + cache marker pattern to \ElasticConfiguration.ConfigureIndexesAsync:

  1. Cache check -- if \configure-indexes\ key exists in the distributed cache, return immediately (zero ES calls, zero lock overhead)
  2. Distributed lock -- acquire via \CacheLockProvider\ (1-minute lock duration, 1-minute wait timeout)
  3. Double-check -- after acquiring the lock, check cache again (another process may have finished while we waited)
  4. Configure -- run the full \ConfigureAsync\ + \MaintainAsync\ on all indexes in parallel
  5. Set cache marker -- 5-minute TTL, only set after successful completion (partial failures retry)

Cache marker is explicitly cleared by \DeleteIndexesAsync, \ReindexAsync, and \MaintainIndexesAsync\ so the next configure call re-runs after any structural change.

Behavior at scale

Caller Cache lookups Lock ops ES API calls
Winner (first process) 2 acquire + release Full configure
Waiter (during configure) 2 acquire + release 0
Late arrival (after configure) 1 0 0

Failure modes

  • Cache down: Falls back to current behavior (everyone configures). Idempotent, just redundant.
  • Lock timeout: Proceeds with warning. Safe because \ConfigureAsync\ is idempotent.
  • Partial failure: Exception propagates, cache marker never set, next caller retries.
  • Manual ES changes: 5-minute TTL safety net ensures reconfiguration.

Design decisions

  • No lock renewal: Even 100+ indexes in parallel complete in seconds. 1-minute lock is generous. If it takes longer, ES is unhealthy.
  • Explicit \indexes\ parameter bypasses lock+cache: Callers who pass specific indexes know what they want.
  • No interface changes: \IElasticConfiguration\ signature is unchanged.

Changes

  • \ElasticConfiguration.ConfigureIndexesAsync\ -- distributed lock + cache marker
  • \ElasticConfiguration.DeleteIndexesAsync\ -- clears cache marker after deletion
  • \ElasticConfiguration.MaintainIndexesAsync\ -- clears cache marker after maintenance
  • \ElasticConfiguration.ReindexAsync\ -- clears cache marker after reindexing

…k + cache marker

Prevent redundant Elasticsearch admin API calls when multiple distributed processes call ConfigureIndexesAsync simultaneously on startup.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds concurrency protection to Elasticsearch index configuration to prevent redundant admin calls when many distributed processes call ConfigureIndexesAsync at startup, using a distributed lock plus a short-lived cache marker.

Changes:

  • Serialize ElasticConfiguration.ConfigureIndexesAsync with a distributed lock and “configured recently” cache marker.
  • Clear the cache marker after index delete, maintenance, and reindex operations so subsequent configure runs re-validate.
  • Convert several methods to async and await the underlying Task.WhenAll to support the new post-work cache invalidation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +215 to +217
await Task.WhenAll(tasks).AnyContext();

await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext();
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache marker removal only runs if all DeleteAsync tasks complete successfully. If deletion partially succeeds and one task faults, Task.WhenAll will throw and the marker won't be cleared, so later ConfigureIndexesAsync calls may skip even though the index set changed. Consider removing the marker in a finally block (or before deletions start) to guarantee invalidation on failures too.

Suggested change
await Task.WhenAll(tasks).AnyContext();
await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext();
try
{
await Task.WhenAll(tasks).AnyContext();
}
finally
{
await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext();
}

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +160
if (await Cache.ExistsAsync(ConfigureIndexesCacheKey).AnyContext())
return;

await using var configLock = await _lockProvider.AcquireAsync(ConfigureIndexesCacheKey,
TimeSpan.FromMinutes(1), TimeSpan.FromMinutes(1)).AnyContext();

if (configLock is not null)
{
if (await Cache.ExistsAsync(ConfigureIndexesCacheKey).AnyContext())
return;
}
else
{
_logger.LogWarning("Unable to acquire configure-indexes lock, continuing without lock");
}

var allTasks = new List<Task>();
foreach (var idx in Indexes)
allTasks.Add(ConfigureIndexInternalAsync(idx, beginReindexingOutdated));

await Task.WhenAll(allTasks).AnyContext();

return Task.WhenAll(tasks);
await Cache.SetAsync(ConfigureIndexesCacheKey, true, TimeSpan.FromMinutes(5)).AnyContext();
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigureIndexesAsync now depends on distributed cache operations (ExistsAsync/SetAsync). If the cache client throws (e.g., Redis outage), ConfigureIndexesAsync will fail even though configuring indexes is otherwise idempotent and should be able to proceed. The PR description says this should fall back to the pre-lock behavior when cache is down—consider wrapping the marker check/set in try/catch (log warning) and treating cache failures as a cache miss / best-effort set.

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +203
var tasks = new List<Task>();
foreach (var idx in indexes)
tasks.Add(idx.MaintainAsync());

return Task.WhenAll(tasks);
await Task.WhenAll(tasks).AnyContext();

await Cache.RemoveAsync(ConfigureIndexesCacheKey).AnyContext();
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache marker removal only runs if all MaintainAsync tasks complete successfully. If any index maintenance fails (or partially succeeds), the marker may remain set and subsequent ConfigureIndexesAsync calls could incorrectly skip reconfiguration. Consider clearing the cache marker in a finally block (or before starting maintenance) so it's removed even when Task.WhenAll throws.

Copilot uses AI. Check for mistakes.
…idation

Wrap all cache operations (ExistsAsync, SetAsync, RemoveAsync) in try/catch so cache outages degrade gracefully to pre-lock behavior instead of failing ConfigureIndexesAsync entirely. Move cache marker removal in DeleteIndexesAsync and MaintainIndexesAsync to finally blocks so partial failures still invalidate the marker. Extract InvalidateConfigureIndexesCacheAsync helper for consistent best-effort cache removal.
Comment on lines +143 to +146
catch (Exception ex)
{
_logger.LogWarning(ex, "Error checking configure-indexes cache marker, will configure indexes");
}
Comment on lines +154 to +157
catch (Exception ex)
{
_logger.LogWarning(ex, "Error acquiring configure-indexes lock, continuing without lock");
}
Comment on lines +168 to +171
catch (Exception ex)
{
_logger.LogWarning(ex, "Error checking configure-indexes cache marker after lock, will configure indexes");
}
Comment on lines +188 to +191
catch (Exception ex)
{
_logger.LogWarning(ex, "Error setting configure-indexes cache marker");
}
Comment on lines +306 to +309
catch (Exception ex)
{
_logger.LogWarning(ex, "Error removing configure-indexes cache marker");
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants