fix(cloudwatchlogs): Prevent Publish from deadlocking getDest#2034
Closed
fix(cloudwatchlogs): Prevent Publish from deadlocking getDest#2034
Conversation
cafecd7 to
e628dc7
Compare
When a destination's queue is full (e.g., due to persistent AccessDenied errors), Publish blocks on AddEvent while holding the cwDest mutex. If the Run loop then calls getDest for a second file mapping to the same destination, it deadlocks waiting for the same mutex. Since the Run loop is single-threaded, ALL log destinations get starved. Fix: Change cwDest from sync.Mutex to sync.RWMutex. Publish takes a read lock only briefly to check the stopped flag, then releases it before calling addEvent. This prevents the deadlock since getDest's write lock is not blocked by a long-held read lock. Stop/getDest/ NotifySourceStopped still use write locks for safe refCount and shutdown management.
e628dc7 to
3de9901
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix deadlock where
Publishblocks on a full queue while holding thecwDestmutex, preventinggetDestfrom acquiring the same mutex and starving all log destinations.Problem
PR #1848 added
Lock()/Unlock()tocwDest.Publish()to prevent sending on a closed channel. However,PublishcallsAddEventwhich blocks when the queue channel (capacity 100) is full. When the sender is stuck in retry backoff (e.g., persistent AccessDenied), the queue fills, andPublishblocks indefinitely while holding the lock.The deadlock chain:
runSrcToDestgoroutine callsdest.Publish()→ acquirescwDest.Lock()Publish→addEvent→queue.AddEvent→ blocks on fulleventsChchannelAddEventblocksFindLogSrc→CreateDest→getDest→d.Lock()on samecwDest→ deadlockSolution
Change
cwDestfromsync.Mutextosync.RWMutex.Publishtakes a read lock only briefly to check thestoppedflag, then releases it before callingaddEvent. This prevents the deadlock sincegetDest's write lock is not blocked by a long-held read lock.Stop/getDest/NotifySourceStoppedstill use write locks for safe refCount and shutdown management.The
eventsChchannel is never closed byqueue.Stop()(it signals via a separatestopCh), so there is no risk of panic from sending on a closed channel after the read lock is released.Changes
cwDest:sync.Mutex→sync.RWMutexPublish():Lock/Unlock→RLock/RUnlock(only around stopped check, released beforeaddEvent)TestPublishDoesNotBlockGetDestunit testTesting
Unit Test
TestPublishDoesNotBlockGetDestreproduces the deadlock:cwDestwith a full queue (channel capacity 1, pre-filled)Publishin a goroutine — blocks onAddEvent(previously while holding lock)getDeston the same destination — must return within 2sgetDestblocks forever onLock()getDestacquires write lock immediately sincePublishonly holds RLock brieflyManual EC2 Reproduction
Reproduced on c7i.xlarge AL2023 with CWA v1.300064.0 (before fix) and v1.300064.1-11-ge628dc79 (after fix):
Config: 2 files (
denied_a.log,denied_b.log) sharing same denied destination + 1 allowed file. IAM deny policy blocks the denied log group.denied_b.logcreated ~12s after startup to triggergetDeston the already-locked destination.denied_bpipedAll Unit Tests
CGO_ENABLED=0 go test -timeout 15m -failfast ./...— all passRelated