[SPARK-57710][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention#56785
Open
HyukjinKwon wants to merge 1 commit into
Open
Conversation
…s from runner contention SPARK-57650 fixed the deterministic ACCEPTED-state hang in BaseYarnClusterSuite (maximum-am-resource-percent). The master Build/Java21 and Build/Java25 `yarn` lanes still go red ~50% of runs: YarnClusterSuite tests intermittently time out (`handle.getState().isFinal() was false`) because the AM/executor containers fail to connect to the driver's RPC server on localhost (Connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses the race exits after the default 3 connection retries, and the application can then never finish. Two test-only mitigations in BaseYarnClusterSuite: - Give the mini NodeManager 8GB (and matching max-allocation) so executor allocation is never starved once the ~1.4GB AM is running. - Raise the executor->driver connection retry budget (spark.rpc.io.maxRetries=10, retryWait=2s) so a transient accept stall does not permanently fail the executor. Individual tests can still override. Co-authored-by: Isaac
uros-b
reviewed
Jun 26, 2026
| // the application unable to finish and the suite times out. Give the executor->driver | ||
| // connection a larger retry budget so a transient stall does not permanently fail the app. | ||
| // These are defaults; individual tests can still override them via extraConf below. | ||
| props.setProperty("spark.rpc.io.maxRetries", "10") |
Member
There was a problem hiding this comment.
Note: the new spark.rpc.io.* defaults are set via setProperty BEFORE the loop that copies spark.* JVM system properties, so a -Dspark.rpc.io.maxRetries flag would silently override them; the comment claims only extraConf can override. Moving the two setProperty calls to just before extraConf.foreach removes the ambiguity.
uros-b
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Follow-up to SPARK-57650, which fixed the deterministic "AM stuck in ACCEPTED" hang in
BaseYarnClusterSuite. Two further test-only changes to reduce the remaining flakiness ofYarnClusterSuite:NodeManager8GB (yarn.nodemanager.resource.memory-mb+yarn.scheduler.maximum-allocation-mb) so executor allocation is never starved once the ~1.4GB AM is running.spark.rpc.io.maxRetries=10,spark.rpc.io.retryWait=2s) so a transient RPC-accept stall does not permanently fail an executor. These are defaults that individual tests can still override viaextraConf.Why are the changes needed?
Even after SPARK-57650, the scheduled
Build / Java21andBuild / Java25master lanes fail in theyarnmodule roughly 50% of runs (e.g. fork run28151220075PASS vs28151247521FAIL — same commit, 40s apart). All failures are the same sixYarnClusterSuitetests timing out after 3 minutes (The code passed to eventually never returned normally ... handle.getState().isFinal() was false).From the
yarn-app-log/unit-tests-logartifacts, the AM/driver comes up, but the executor (and sometimes the AM) intermittently fail to connect back to the driver's RPC server onlocalhost(java.io.IOException: Failed to connect to localhost/127.0.0.1:<port>, connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses this race exits after the default 3 connection retries, and the application can then never reach a final state.Does this PR introduce any user-facing change?
No. Test-only.
How was this patch tested?
YarnClusterSuitewas previously failing ~50% of the time. With this change theyarnmodule job was run 6 times on the fork; all 6 passed, withYarnClusterSuitereportingtests=30, failures=0, skipped=0(the 6 formerly-failing tests now pass):28148781009(Build / Java21) — 6YarnClusterSuitetimeouts.28162182834,28162247111,28162249819,28175759262,28175762257,28175765871—yarnjob green in all six.Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Claude Code.