[SPARK-57711][SQL][TEST] Deflake SQLAppStatusListenerMemoryLeakSuite "no memory leak"#56790
[SPARK-57711][SQL][TEST] Deflake SQLAppStatusListenerMemoryLeakSuite "no memory leak"#56790HyukjinKwon wants to merge 1 commit into
Conversation
… "no memory leak" ### What changes were proposed in this pull request? Wrap the final `noLiveData()` assertion in `SQLAppStatusListenerMemoryLeakSuite` in an `eventually(...)` block. ### Why are the changes needed? The test fails intermittently on the SBT `sql - other tests` job (e.g. master runs 28000645341, 28004073554) with `noLiveData() was false`. The cleanup that removes entries from `liveExecutions`/`stageMetrics` is finalized by the metrics aggregation triggered on `SparkListenerSQLExecutionEnd`; asserting immediately after `waitUntilEmpty()` is racy. Polling with `eventually` removes the race without weakening the assertion. ### Does this PR introduce any user-facing change? No, test only. ### How was this patch tested? Existing test, repeated. Co-authored-by: Isaac
| assert(statusStore.planGraphCount() <= 50) | ||
| // No live data should be left behind after all executions end. | ||
| assert(statusStore.listener.get.noLiveData()) | ||
| // No live data should be left behind after all executions end. The cleanup of live |
There was a problem hiding this comment.
The added comment / PR body pin the flake on the kvstore.doAsync cleanup completing after the bus drains, but the test sets ASYNC_TRACKING_ENABLED=false, which makes doAsync run synchronously on the listener thread (sameThreadExecutor); so waitUntilEmpty() should already cover that path. The stated root cause likely doesn't hold under this config; the real residual (e.g. stageMetrics cleanup) isn't identified. The fix is still safe, but please clarify the actual race (or confirm the CI env doesn't apply the config) so this isn't masking an undiagnosed condition. Tighten the inline comment accordingly.
| // executions/stage metrics is finalized when the metrics aggregation triggered by the | ||
| // SQLExecutionEnd event completes, so wait for the listener to drain rather than asserting | ||
| // immediately to avoid a timing race. | ||
| eventually(timeout(10.seconds), interval(10.milliseconds)) { |
There was a problem hiding this comment.
Note: stacking a 10s eventually on top of the existing 10s waitUntilEmpty() doubles worst-case hang to ~20s on a genuinely broken system.
What changes were proposed in this pull request?
Wrap the final
noLiveData()assertion inSQLAppStatusListenerMemoryLeakSuite("no memory leak") in an
eventually(...)block so it polls for the asynchronouscleanup instead of asserting once immediately.
Why are the changes needed?
The test is flaky on the SBT
sql - other testsjob, failing withnoLiveData() was false:SQLAppStatusListenerremoves entries fromliveExecutions/stageMetricsonly once the per-execution end-event count reaches
jobs.size + 1. The finalincrement + removal happens in the metrics-aggregation callback triggered by
SparkListenerSQLExecutionEnd(onExecutionEnd->kvstore.doAsync { ... }).Asserting
noLiveData()right aftersc.listenerBus.waitUntilEmpty()is racy:the listener bus queue can be drained before that cleanup callback has fully
removed the live entries, so the test intermittently observes leftover live data.
Polling with
eventuallylets the cleanup settle without weakening theassertion.
Does this PR introduce any user-facing change?
No, test only.
How was this patch tested?
Existing test, run repeatedly. This reproduces under SBT, so the standard fork
"Build" (SBT) exercises it.
CI evidence
sql - other tests—noLiveData() was false)Was this patch authored or co-authored using generative AI tooling?
Yes, drafted with assistance from Isaac.