replay: fix race between FlushEnd and refreshMetrics#5822
replay: fix race between FlushEnd and refreshMetrics#5822xxmplus wants to merge 1 commit intocockroachdb:masterfrom
Conversation
sumeerbhola
left a comment
There was a problem hiding this comment.
I am looking at Runner.nextCompactionCompletes and related replay logic for the first time, and this is quite confusing. At minimum it needs some comments describing state transitions (AI is good at that).
Also, I suspect the fix here is not to adding timeouts and instead stepping back and examining the assumptions that motivated this code. Presumably we wanted to calculate metrics periodically when something interesting happened. Doing it on compactions ending was one such trigger. Since the quiescing logic also looks at flushes, what if we recompute metrics both on compactions and flushes completing, and redo the code to track both {flush,compaction}x{begin,end}. So the data-structure would be:
compactionOrFlushMu struct {
sync.Mutex
ch chan struct{}
started int64
completed int64
}
and we will also have an event handler for FlushBegin.
And I don't understand the existing logic:
func (r *Runner) compactionsAppearQuiesced(m *pebble.Metrics) bool {
r.compactionMu.Lock()
defer r.compactionMu.Unlock()
if m.Flush.NumInProgress > 0 {
return false
} else if m.Compact.NumInProgress > 0 && r.compactionMu.started != r.compactionMu.completed {
return false
}
return true
}
I would have expected a disjunction in the second condition.
In that case, with what I am suggesting:
func (r *Runner) compactionsAppearQuiesced(m *pebble.Metrics) bool {
r.compactionMu.Lock()
defer r.compactionMu.Unlock()
if m.Flush.NumInProgress > 0 {
return false
} else if m.Compact.NumInProgress > 0 {
return false
} else if && r.compactionOrFlushMu.started != r.compactionOrFlushMu.completed {
return false
}
return true
}
@sumeerbhola made 1 comment.
Reviewable status: 0 of 1 files reviewed, all discussions resolved.
xxmplus
left a comment
There was a problem hiding this comment.
Thanks Sumeer. I asked Claude to summarize the changes below.
You're right that tracking both flushes and compactions in the counter is a much cleaner fix. It eliminates the race at the source rather than working around it with a timeout. I've updated the PR to rename compactionMu to compactionOrFlushMu, add a FlushBegin handler that increments started, and update FlushEnd to also increment completed that matches CompactionEnd. The timeout in the first select is removed since it's no longer needed.
I've also added state transition comments to the compactionOrFlushMu struct definition documenting which events update which fields and how the channel lifecycle works:
// State transitions:
// FlushBegin / CompactionBegin: started++
// FlushEnd / CompactionEnd: completed++; close(ch); ch = nil
//
// The channel ch is created on-demand by nextCompactionOrFlushCompletes
// and closed by the End handlers to wake any goroutine waiting for
// activity to finish. The started/completed counters allow
// nextCompactionOrFlushCompletes to detect events that occurred between
// the caller's last observation and the current call, without relying
// solely on the channel.
Re the conjunction in compactionsAppearQuiesced: the original "&&" was intentional to handle stale metrics (when metrics still show NumInProgress > 0 but the counter already reflects the completion, we can skip an extra iteration). However, the conjunction has a subtle gap in the other direction: if a flush starts between r.d.Metrics() and the quiescence check, metrics are stale low (NumInProgress = 0) but the counter correctly shows started > completed. The conjunction doesn't catch this because neither Flush.NumInProgress > 0 nor Compact.NumInProgress > 0 triggers, so it falls through to return true. The disjunction catches it via the independent started != completed check. In practice the 1 second quiescence wait recovers either way, but the disjunction is more defensively correct. I've adopted your suggestion.
@xxmplus made 1 comment.
Reviewable status: 0 of 1 files reviewed, all discussions resolved.
d56af99 to
e600ab3
Compare
cbf9e08 to
8b142d6
Compare
527ea2b added a FlushEnd handler that closes compactionMu.ch to wake refreshMetrics when a flush completes. Unlike CompactionEnd, the handler does not increment the completed counter because that counter only tracks compactions. This creates a gap: nextCompactionCompletes has no way to detect that a flush notification was already delivered, so it can create a new channel that nobody will ever close. The race occurs in refreshMetrics between r.d.Metrics() and nextCompactionCompletes(). If a flush completes in this window: 1. r.d.Metrics() acquires d.mu, observes flushing=true (Flush.NumInProgress=1), releases d.mu. 2. The flush goroutine acquires d.mu, completes, fires FlushEnd (under d.mu). The replay handler closes compactionMu.ch and nils it. 3. nextCompactionCompletes sees ch==nil, creates a new channel. With no counter increment to detect the flush, it returns alreadyOccurred=false. 4. compactionsAppearQuiesced uses the stale metrics from step 1 (NumInProgress=1) and returns false. 5. The loop re-enters the first select with a channel nobody will close and stepsApplied==nil (blocks forever). Permanent hang. Fix this with three changes: 1. Track both flushes and compactions in the started/completed counters. Rename compactionMu to compactionOrFlushMu and add a FlushBegin handler that increments started. The FlushEnd handler now also increments completed, matching CompactionEnd. This allows nextCompactionOrFlushCompletes to detect flush completions through the counter, eliminating the race. 2. Switch compactionsAppearQuiesced to use only the started/completed counter (started == completed) instead of checking DB.Metrics().NumInProgress. There is a scheduling window between AddInProgressLocked (which increments NumInProgress under d.mu) and CompactionBegin (which fires in a separate goroutine that must re-acquire d.mu). During this window NumInProgress > 0 but started == completed. Using NumInProgress would block quiescence detection during cascading compactions. The counter does not have this window, and the 1-second quiescence confirmation handles any false positives from compactions that are scheduled but have not yet fired CompactionBegin. 3. Fix a pre-existing tight loop in refreshMetrics: when nextCompactionOrFlushCompletes detects a completion via the counter (alreadyOccurred=true), the old code skipped the first select and immediately re-acquired d.mu to collect metrics, even though the quiescence check would be skipped anyway. Under heavy compaction load this tight loop contends with event handlers for d.mu. Fix this by catching up the counter without collecting metrics when alreadyOccurred is true, then falling through to collect metrics once caught up. Fixes cockroachdb#5820. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8b142d6 to
20d2c08
Compare
xxmplus
left a comment
There was a problem hiding this comment.
Scratch the "Re the conjunction in ...." above, I ran into more failures after the fix above. Here is the new fix and updated comment.
For compactionsAppearQuiesced, I went a step further and switched to a counter-only check (started == completed) rather than a disjunction. While investigating the CI failures I discovered a scheduling window in Pebble between AddInProgressLocked (which increments NumInProgress under d.mu) and CompactionBegin (which fires in a separate goroutine after re-acquiring d.mu). During this window NumInProgress > 0 but started == completed. Any check that uses NumInProgress (whether conjunction or disjunction) interacts poorly with cascading compactions under the race detector, because each completion schedules the next compaction through this window. The counter alone does not have this problem, and the existing 1-second quiescence confirmation handles any false positives from compactions that are scheduled but have not yet fired CompactionBegin.
I also fixed a pre-existing tight loop: when nextCompactionOrFlushCompletes detected a completion via the counter (alreadyOccurred=true), the old code skipped the first select and immediately re-acquired d.mu to collect metrics, even though the quiescence check would be skipped anyway. Under heavy compaction load this contends with event handlers for d.mu. The fix catches up the counter without collecting metrics, then falls through to collect metrics once caught up.
@xxmplus made 1 comment.
Reviewable status: 0 of 2 files reviewed, all discussions resolved.
527ea2b added a FlushEnd handler that closes compactionMu.ch to wake refreshMetrics when a flush completes. Unlike CompactionEnd, the handler does not increment the completed counter because that counter only tracks compactions. This creates a gap: nextCompactionCompletes has no way to detect that a flush notification was already delivered, so it can create a new channel that nobody will ever close.
The race occurs in refreshMetrics between r.d.Metrics() and nextCompactionCompletes(). If a flush completes in this window:
Fix this with three changes:
Track both flushes and compactions in the started/completed counters. Rename compactionMu to compactionOrFlushMu and add a FlushBegin handler that increments started. The FlushEnd handler now also increments completed, matching CompactionEnd. This allows nextCompactionOrFlushCompletes to detect flush completions through the counter, eliminating the race.
Switch compactionsAppearQuiesced to use only the started/completed counter (started == completed) instead of checking DB.Metrics().NumInProgress. There is a scheduling window between AddInProgressLocked (which increments NumInProgress under d.mu) and CompactionBegin (which fires in a separate goroutine that must re-acquire d.mu). During this window NumInProgress > 0 but started == completed. Using NumInProgress would block quiescence detection during cascading compactions. The counter does not have this window, and the 1-second quiescence confirmation handles any false positives from compactions that are scheduled but have not yet fired CompactionBegin.
Fix a pre-existing tight loop in refreshMetrics: when nextCompactionOrFlushCompletes detects a completion via the counter (alreadyOccurred=true), the old code skipped the first select and immediately re-acquired d.mu to collect metrics, even though the quiescence check would be skipped anyway. Under heavy compaction load this tight loop contends with event handlers for d.mu. Fix this by catching up the counter without collecting metrics when alreadyOccurred is true, then falling through to collect metrics once caught up.
Fixes #5820.