Skip to content

[SPARK-57711][SQL][TEST] Deflake SQLAppStatusListenerMemoryLeakSuite "no memory leak"#56790

Open
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/agent5-sqlappstatus-leak
Open

[SPARK-57711][SQL][TEST] Deflake SQLAppStatusListenerMemoryLeakSuite "no memory leak"#56790
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/agent5-sqlappstatus-leak

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Wrap the final noLiveData() assertion in SQLAppStatusListenerMemoryLeakSuite
("no memory leak") in an eventually(...) block so it polls for the asynchronous
cleanup instead of asserting once immediately.

Why are the changes needed?

The test is flaky on the SBT sql - other tests job, failing with
noLiveData() was false:

assert(statusStore.listener.get.noLiveData())

SQLAppStatusListener removes entries from liveExecutions / stageMetrics
only once the per-execution end-event count reaches jobs.size + 1. The final
increment + removal happens in the metrics-aggregation callback triggered by
SparkListenerSQLExecutionEnd (onExecutionEnd -> kvstore.doAsync { ... }).
Asserting noLiveData() right after sc.listenerBus.waitUntilEmpty() is racy:
the listener bus queue can be drained before that cleanup callback has fully
removed the live entries, so the test intermittently observes leftover live data.
Polling with eventually lets the cleanup settle without weakening the
assertion.

Does this PR introduce any user-facing change?

No, test only.

How was this patch tested?

Existing test, run repeatedly. This reproduces under SBT, so the standard fork
"Build" (SBT) exercises it.

CI evidence

Was this patch authored or co-authored using generative AI tooling?

Yes, drafted with assistance from Isaac.

… "no memory leak"

### What changes were proposed in this pull request?
Wrap the final `noLiveData()` assertion in `SQLAppStatusListenerMemoryLeakSuite`
in an `eventually(...)` block.

### Why are the changes needed?
The test fails intermittently on the SBT `sql - other tests` job (e.g. master
runs 28000645341, 28004073554) with `noLiveData() was false`. The cleanup that
removes entries from `liveExecutions`/`stageMetrics` is finalized by the metrics
aggregation triggered on `SparkListenerSQLExecutionEnd`; asserting immediately
after `waitUntilEmpty()` is racy. Polling with `eventually` removes the race
without weakening the assertion.

### Does this PR introduce any user-facing change?
No, test only.

### How was this patch tested?
Existing test, repeated.

Co-authored-by: Isaac
@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE][SQL][TEST] Deflake SQLAppStatusListenerMemoryLeakSuite "no memory leak" [SPARK-57711][SQL][TEST] Deflake SQLAppStatusListenerMemoryLeakSuite "no memory leak" Jun 26, 2026
@HyukjinKwon HyukjinKwon marked this pull request as ready for review June 26, 2026 05:57
assert(statusStore.planGraphCount() <= 50)
// No live data should be left behind after all executions end.
assert(statusStore.listener.get.noLiveData())
// No live data should be left behind after all executions end. The cleanup of live

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added comment / PR body pin the flake on the kvstore.doAsync cleanup completing after the bus drains, but the test sets ASYNC_TRACKING_ENABLED=false, which makes doAsync run synchronously on the listener thread (sameThreadExecutor); so waitUntilEmpty() should already cover that path. The stated root cause likely doesn't hold under this config; the real residual (e.g. stageMetrics cleanup) isn't identified. The fix is still safe, but please clarify the actual race (or confirm the CI env doesn't apply the config) so this isn't masking an undiagnosed condition. Tighten the inline comment accordingly.

// executions/stage metrics is finalized when the metrics aggregation triggered by the
// SQLExecutionEnd event completes, so wait for the listener to drain rather than asserting
// immediately to avoid a timing race.
eventually(timeout(10.seconds), interval(10.milliseconds)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: stacking a 10s eventually on top of the existing 10s waitUntilEmpty() doubles worst-case hang to ~20s on a genuinely broken system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants