Skip to content

replay: fix race between FlushEnd and refreshMetrics#5822

Open
xxmplus wants to merge 1 commit intocockroachdb:masterfrom
xxmplus:test-flush-wait-fail-5820
Open

replay: fix race between FlushEnd and refreshMetrics#5822
xxmplus wants to merge 1 commit intocockroachdb:masterfrom
xxmplus:test-flush-wait-fail-5820

Conversation

@xxmplus
Copy link
Contributor

@xxmplus xxmplus commented Feb 27, 2026

527ea2b added a FlushEnd handler that closes compactionMu.ch to wake refreshMetrics when a flush completes. Unlike CompactionEnd, the handler does not increment the completed counter because that counter only tracks compactions. This creates a gap: nextCompactionCompletes has no way to detect that a flush notification was already delivered, so it can create a new channel that nobody will ever close.

The race occurs in refreshMetrics between r.d.Metrics() and nextCompactionCompletes(). If a flush completes in this window:

  1. r.d.Metrics() acquires d.mu, observes flushing=true (Flush.NumInProgress=1), releases d.mu.
  2. The flush goroutine acquires d.mu, completes, fires FlushEnd (under d.mu). The replay handler closes compactionMu.ch and nils it.
  3. nextCompactionCompletes sees ch==nil, creates a new channel. With no counter increment to detect the flush, it returns alreadyOccurred=false.
  4. compactionsAppearQuiesced uses the stale metrics from step 1 (NumInProgress=1) and returns false.
  5. The loop re-enters the first select with a channel nobody will close and stepsApplied==nil (blocks forever). Permanent hang.

Fix this with three changes:

  1. Track both flushes and compactions in the started/completed counters. Rename compactionMu to compactionOrFlushMu and add a FlushBegin handler that increments started. The FlushEnd handler now also increments completed, matching CompactionEnd. This allows nextCompactionOrFlushCompletes to detect flush completions through the counter, eliminating the race.

  2. Switch compactionsAppearQuiesced to use only the started/completed counter (started == completed) instead of checking DB.Metrics().NumInProgress. There is a scheduling window between AddInProgressLocked (which increments NumInProgress under d.mu) and CompactionBegin (which fires in a separate goroutine that must re-acquire d.mu). During this window NumInProgress > 0 but started == completed. Using NumInProgress would block quiescence detection during cascading compactions. The counter does not have this window, and the 1-second quiescence confirmation handles any false positives from compactions that are scheduled but have not yet fired CompactionBegin.

  3. Fix a pre-existing tight loop in refreshMetrics: when nextCompactionOrFlushCompletes detects a completion via the counter (alreadyOccurred=true), the old code skipped the first select and immediately re-acquired d.mu to collect metrics, even though the quiescence check would be skipped anyway. Under heavy compaction load this tight loop contends with event handlers for d.mu. Fix this by catching up the counter without collecting metrics when alreadyOccurred is true, then falling through to collect metrics once caught up.

Fixes #5820.

@xxmplus xxmplus requested a review from a team as a code owner February 27, 2026 17:56
@xxmplus xxmplus requested a review from sumeerbhola February 27, 2026 17:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking at Runner.nextCompactionCompletes and related replay logic for the first time, and this is quite confusing. At minimum it needs some comments describing state transitions (AI is good at that).
Also, I suspect the fix here is not to adding timeouts and instead stepping back and examining the assumptions that motivated this code. Presumably we wanted to calculate metrics periodically when something interesting happened. Doing it on compactions ending was one such trigger. Since the quiescing logic also looks at flushes, what if we recompute metrics both on compactions and flushes completing, and redo the code to track both {flush,compaction}x{begin,end}. So the data-structure would be:

	compactionOrFlushMu struct {
		sync.Mutex
		ch        chan struct{}
		started   int64
		completed int64
	}

and we will also have an event handler for FlushBegin.

And I don't understand the existing logic:

func (r *Runner) compactionsAppearQuiesced(m *pebble.Metrics) bool {
	r.compactionMu.Lock()
	defer r.compactionMu.Unlock()
	if m.Flush.NumInProgress > 0 {
		return false
	} else if m.Compact.NumInProgress > 0 && r.compactionMu.started != r.compactionMu.completed {
		return false
	}
	return true
}

I would have expected a disjunction in the second condition.

In that case, with what I am suggesting:

func (r *Runner) compactionsAppearQuiesced(m *pebble.Metrics) bool {
	r.compactionMu.Lock()
	defer r.compactionMu.Unlock()
	if m.Flush.NumInProgress > 0 {
		return false
	} else if m.Compact.NumInProgress > 0 {
		return false
	} else if && r.compactionOrFlushMu.started != r.compactionOrFlushMu.completed {
		return false
	}
	return true
}

@sumeerbhola made 1 comment.
Reviewable status: 0 of 1 files reviewed, all discussions resolved.

Copy link
Contributor Author

@xxmplus xxmplus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sumeer. I asked Claude to summarize the changes below.

You're right that tracking both flushes and compactions in the counter is a much cleaner fix. It eliminates the race at the source rather than working around it with a timeout. I've updated the PR to rename compactionMu to compactionOrFlushMu, add a FlushBegin handler that increments started, and update FlushEnd to also increment completed that matches CompactionEnd. The timeout in the first select is removed since it's no longer needed.

I've also added state transition comments to the compactionOrFlushMu struct definition documenting which events update which fields and how the channel lifecycle works:

// State transitions:
//   FlushBegin / CompactionBegin: started++
//   FlushEnd   / CompactionEnd:   completed++; close(ch); ch = nil
//
// The channel ch is created on-demand by nextCompactionOrFlushCompletes
// and closed by the End handlers to wake any goroutine waiting for
// activity to finish. The started/completed counters allow
// nextCompactionOrFlushCompletes to detect events that occurred between
// the caller's last observation and the current call, without relying
// solely on the channel.

Re the conjunction in compactionsAppearQuiesced: the original "&&" was intentional to handle stale metrics (when metrics still show NumInProgress > 0 but the counter already reflects the completion, we can skip an extra iteration). However, the conjunction has a subtle gap in the other direction: if a flush starts between r.d.Metrics() and the quiescence check, metrics are stale low (NumInProgress = 0) but the counter correctly shows started > completed. The conjunction doesn't catch this because neither Flush.NumInProgress > 0 nor Compact.NumInProgress > 0 triggers, so it falls through to return true. The disjunction catches it via the independent started != completed check. In practice the 1 second quiescence wait recovers either way, but the disjunction is more defensively correct. I've adopted your suggestion.

@xxmplus made 1 comment.
Reviewable status: 0 of 1 files reviewed, all discussions resolved.

@xxmplus xxmplus force-pushed the test-flush-wait-fail-5820 branch 3 times, most recently from d56af99 to e600ab3 Compare March 4, 2026 22:19
@xxmplus xxmplus marked this pull request as draft March 4, 2026 22:31
@xxmplus xxmplus force-pushed the test-flush-wait-fail-5820 branch 2 times, most recently from cbf9e08 to 8b142d6 Compare March 5, 2026 01:41
527ea2b added a FlushEnd handler that closes compactionMu.ch to wake
refreshMetrics when a flush completes. Unlike CompactionEnd, the
handler does not increment the completed counter because that counter
only tracks compactions. This creates a gap: nextCompactionCompletes
has no way to detect that a flush notification was already delivered,
so it can create a new channel that nobody will ever close.

The race occurs in refreshMetrics between r.d.Metrics() and
nextCompactionCompletes(). If a flush completes in this window:

 1. r.d.Metrics() acquires d.mu, observes flushing=true
    (Flush.NumInProgress=1), releases d.mu.
 2. The flush goroutine acquires d.mu, completes, fires FlushEnd
    (under d.mu). The replay handler closes compactionMu.ch and
    nils it.
 3. nextCompactionCompletes sees ch==nil, creates a new channel.
    With no counter increment to detect the flush, it returns
    alreadyOccurred=false.
 4. compactionsAppearQuiesced uses the stale metrics from step 1
    (NumInProgress=1) and returns false.
 5. The loop re-enters the first select with a channel nobody will
    close and stepsApplied==nil (blocks forever). Permanent hang.

Fix this with three changes:

1. Track both flushes and compactions in the started/completed
   counters. Rename compactionMu to compactionOrFlushMu and add a
   FlushBegin handler that increments started. The FlushEnd handler
   now also increments completed, matching CompactionEnd. This allows
   nextCompactionOrFlushCompletes to detect flush completions through
   the counter, eliminating the race.

2. Switch compactionsAppearQuiesced to use only the started/completed
   counter (started == completed) instead of checking
   DB.Metrics().NumInProgress. There is a scheduling window between
   AddInProgressLocked (which increments NumInProgress under d.mu)
   and CompactionBegin (which fires in a separate goroutine that must
   re-acquire d.mu). During this window NumInProgress > 0 but
   started == completed. Using NumInProgress would block quiescence
   detection during cascading compactions. The counter does not have
   this window, and the 1-second quiescence confirmation handles any
   false positives from compactions that are scheduled but have not
   yet fired CompactionBegin.

3. Fix a pre-existing tight loop in refreshMetrics: when
   nextCompactionOrFlushCompletes detects a completion via the
   counter (alreadyOccurred=true), the old code skipped the first
   select and immediately re-acquired d.mu to collect metrics, even
   though the quiescence check would be skipped anyway. Under heavy
   compaction load this tight loop contends with event handlers for
   d.mu. Fix this by catching up the counter without collecting
   metrics when alreadyOccurred is true, then falling through to
   collect metrics once caught up.

Fixes cockroachdb#5820.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@xxmplus xxmplus force-pushed the test-flush-wait-fail-5820 branch from 8b142d6 to 20d2c08 Compare March 5, 2026 01:57
Copy link
Contributor Author

@xxmplus xxmplus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch the "Re the conjunction in ...." above, I ran into more failures after the fix above. Here is the new fix and updated comment.

For compactionsAppearQuiesced, I went a step further and switched to a counter-only check (started == completed) rather than a disjunction. While investigating the CI failures I discovered a scheduling window in Pebble between AddInProgressLocked (which increments NumInProgress under d.mu) and CompactionBegin (which fires in a separate goroutine after re-acquiring d.mu). During this window NumInProgress > 0 but started == completed. Any check that uses NumInProgress (whether conjunction or disjunction) interacts poorly with cascading compactions under the race detector, because each completion schedules the next compaction through this window. The counter alone does not have this problem, and the existing 1-second quiescence confirmation handles any false positives from compactions that are scheduled but have not yet fired CompactionBegin.

I also fixed a pre-existing tight loop: when nextCompactionOrFlushCompletes detected a completion via the counter (alreadyOccurred=true), the old code skipped the first select and immediately re-acquired d.mu to collect metrics, even though the quiescence check would be skipped anyway. Under heavy compaction load this contends with event handlers for d.mu. The fix catches up the counter without collecting metrics, then falls through to collect metrics once caught up.

@xxmplus made 1 comment.
Reviewable status: 0 of 2 files reviewed, all discussions resolved.

@xxmplus xxmplus marked this pull request as ready for review March 5, 2026 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

master: TestFlushEndNotifiesRefreshMetrics failed (cond never satisfied)

3 participants