Skip to content

Fix CI flakiness: async race conditions, disposal crash, and stdout truncation#9939

Open
mmitche wants to merge 38 commits intomainfrom
feature/ci-flakiness
Open

Fix CI flakiness: async race conditions, disposal crash, and stdout truncation#9939
mmitche wants to merge 38 commits intomainfrom
feature/ci-flakiness

Conversation

@mmitche
Copy link
Member

@mmitche mmitche commented Mar 7, 2026

Summary

Fixes ~63% CI failure rate on the main branch caused by three independent race conditions.

Root Causes Identified

1. Parallel.For async void anti-pattern (TemplatePackageManager.cs)

Parallel.For(0, N, async (i) => ...) converts the lambda to async void. Parallel.For returns immediately after the synchronous portion of each iteration completes (up to the first await), not after the async work finishes. This leaves scanResults[] partially populated with nulls, causing OrderOfScanningIsCorrect test failures (expected sample99, got random values like sample69).

Fix: Replace with await Task.WhenAll(Enumerable.Range(...).Select(async ...))

2. GlobalSettings.FileChanged disposal race condition (GlobalSettings.cs)

FileChanged is an async void callback from a file watcher. The original Dispose() called _watcher.Dispose() before setting _disposed = true. The callback could fire between these two operations, hitting ObjectDisposedException on the SettingsChanged event handler. Since the exception is unhandled in async void, it crashes the entire xunit test runner process ΓÇö killing ALL Edge.UnitTests, not just the file watcher test.

Fix: Set _disposed = true before watcher disposal, add disposed check and ObjectDisposedException catch in FileChanged

3. Console output truncation in CLI commands (ExecutableCommand.cs)

SimpleConsoleLogger uses a background queue thread for console output. LoggerFactory.Dispose() drains the queue but the Console.Out pipe buffer may not be flushed before process exit. This causes mid-line output truncation in ValidateCommand_BasicTest integration tests.

Fix: Explicit Console.Out.Flush()/Console.Error.Flush() after logger factory disposal

Validation

10 consecutive CI passes on this branch (builds 1324248-1324572), compared to ~63% failure rate (20/32 failures) on the main branch over the same period.

Metric Before After
Failure rate ~63% (20/32) 0% (0/10)
Consecutive passes 1-2 max 10+
Test runner crashes Frequent None

Files Changed

  • src/Microsoft.TemplateEngine.Edge/Settings/TemplatePackageManager.cs ΓÇö Parallel.For to Task.WhenAll
  • src/Microsoft.TemplateEngine.Edge/BuiltInManagedProvider/GlobalSettings.cs ΓÇö Disposal ordering + FileChanged guards
  • test/Microsoft.TemplateEngine.Edge.UnitTests/GlobalSettingsTests.cs ΓÇö Robustified TestFileWatcher event handler
  • tools/Microsoft.TemplateEngine.Authoring.CLI/Commands/ExecutableCommand.cs ΓÇö Console flush after logger disposal

…runcation

Root Cause Analysis of ~63% CI failure rate on main branch:

1. Parallel.For async void anti-pattern (TemplatePackageManager.cs)
   - Parallel.For(0, N, async (i) => ...) converts lambda to async void
   - Parallel.For returns after synchronous portion, before awaits complete
   - scanResults[] left partially null, causing OrderOfScanningIsCorrect failures
   - Fix: Replace with await Task.WhenAll(Enumerable.Range(...).Select(async ...))

2. GlobalSettings.FileChanged disposal race condition (GlobalSettings.cs)
   - FileChanged is async void callback from file watcher
   - Original Dispose() called _watcher.Dispose() before setting _disposed = true
   - Callback could fire between these, hitting ObjectDisposedException
   - Unhandled exception in async void crashes entire xunit test runner process
   - Fix: Set _disposed = true before watcher disposal, add disposed checks
     and try-catch in FileChanged callback

3. Console output truncation in CLI commands (ExecutableCommand.cs)
   - SimpleConsoleLogger uses background queue thread for output
   - LoggerFactory.Dispose() drains queue but Console.Out pipe buffer not flushed
   - Process exit before flush causes mid-line truncation in integration tests
   - Fix: Explicit Console.Out.Flush()/Console.Error.Flush() after disposal

Validation: 10 consecutive CI passes (builds 1324248-1324572) on this branch,
compared to ~63% failure rate (20/32 failures) on main branch.

Co-authored-by: Copilot <[email protected]>
@mmitche mmitche requested a review from a team as a code owner March 7, 2026 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant