Skip to content

fix(cloudwatchlogs): Prevent Publish from deadlocking getDest#2034

Closed
the-mann wants to merge 1 commit intomainfrom
fix/publish-deadlock
Closed

fix(cloudwatchlogs): Prevent Publish from deadlocking getDest#2034
the-mann wants to merge 1 commit intomainfrom
fix/publish-deadlock

Conversation

@the-mann
Copy link
Contributor

@the-mann the-mann commented Feb 20, 2026

Summary

Fix deadlock where Publish blocks on a full queue while holding the cwDest mutex, preventing getDest from acquiring the same mutex and starving all log destinations.

Problem

PR #1848 added Lock()/Unlock() to cwDest.Publish() to prevent sending on a closed channel. However, Publish calls AddEvent which blocks when the queue channel (capacity 100) is full. When the sender is stuck in retry backoff (e.g., persistent AccessDenied), the queue fills, and Publish blocks indefinitely while holding the lock.

The deadlock chain:

  1. runSrcToDest goroutine calls dest.Publish() → acquires cwDest.Lock()
  2. PublishaddEventqueue.AddEvent → blocks on full eventsCh channel
  3. Lock held indefinitely while AddEvent blocks
  4. Run loop calls FindLogSrcCreateDestgetDestd.Lock() on same cwDestdeadlock
  5. Run loop is single-threaded → ALL log destinations starved

Solution

Change cwDest from sync.Mutex to sync.RWMutex. Publish takes a read lock only briefly to check the stopped flag, then releases it before calling addEvent. This prevents the deadlock since getDest's write lock is not blocked by a long-held read lock. Stop/getDest/NotifySourceStopped still use write locks for safe refCount and shutdown management.

The eventsCh channel is never closed by queue.Stop() (it signals via a separate stopCh), so there is no risk of panic from sending on a closed channel after the read lock is released.

Changes

  • cwDest: sync.Mutexsync.RWMutex
  • Publish(): Lock/UnlockRLock/RUnlock (only around stopped check, released before addEvent)
  • Add TestPublishDoesNotBlockGetDest unit test

Testing

Unit Test

TestPublishDoesNotBlockGetDest reproduces the deadlock:

  1. Creates a cwDest with a full queue (channel capacity 1, pre-filled)
  2. Calls Publish in a goroutine — blocks on AddEvent (previously while holding lock)
  3. Calls getDest on the same destination — must return within 2s
  4. Before fix: DEADLOCK — getDest blocks forever on Lock()
  5. After fix: PASS — getDest acquires write lock immediately since Publish only holds RLock briefly

Manual EC2 Reproduction

Reproduced on c7i.xlarge AL2023 with CWA v1.300064.0 (before fix) and v1.300064.1-11-ge628dc79 (after fix):

Config: 2 files (denied_a.log, denied_b.log) sharing same denied destination + 1 allowed file. IAM deny policy blocks the denied log group. denied_b.log created ~12s after startup to trigger getDest on the already-locked destination.

Metric Before (v1.300064.0) After (fix branch)
Run loop ticker Stopped at 18:37:35Z Still ticking at 19:19:12Z ✅
denied_b piped Never (getDest deadlocked) 19:18:03Z ✅
Allowed group publishing Stopped at 18:37:55Z Published through 19:18:24Z ✅
Observed deadlock duration 3+ minutes, no recovery No deadlock ✅

All Unit Tests

CGO_ENABLED=0 go test -timeout 15m -failfast ./... — all pass

Related

@the-mann the-mann force-pushed the fix/publish-deadlock branch from cafecd7 to e628dc7 Compare February 20, 2026 19:08
When a destination's queue is full (e.g., due to persistent AccessDenied
errors), Publish blocks on AddEvent while holding the cwDest mutex. If
the Run loop then calls getDest for a second file mapping to the same
destination, it deadlocks waiting for the same mutex. Since the Run loop
is single-threaded, ALL log destinations get starved.

Fix: Change cwDest from sync.Mutex to sync.RWMutex. Publish takes a
read lock only briefly to check the stopped flag, then releases it
before calling addEvent. This prevents the deadlock since getDest's
write lock is not blocked by a long-held read lock. Stop/getDest/
NotifySourceStopped still use write locks for safe refCount and
shutdown management.
@the-mann the-mann force-pushed the fix/publish-deadlock branch from e628dc7 to 3de9901 Compare February 20, 2026 19:50
@the-mann the-mann closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant