[SPARK-57710][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention by HyukjinKwon · Pull Request #56785 · apache/spark

HyukjinKwon · 2026-06-25T22:11:45Z

What changes were proposed in this pull request?

Follow-up to SPARK-57650, which fixed the deterministic "AM stuck in ACCEPTED" hang in BaseYarnClusterSuite. Two further test-only changes to reduce the remaining flakiness of YarnClusterSuite:

Give the single mini NodeManager 8GB (yarn.nodemanager.resource.memory-mb + yarn.scheduler.maximum-allocation-mb) so executor allocation is never starved once the ~1.4GB AM is running.
Raise the executor→driver connection retry budget for the launched apps (spark.rpc.io.maxRetries=10, spark.rpc.io.retryWait=2s) so a transient RPC-accept stall does not permanently fail an executor. These are defaults that individual tests can still override via extraConf.

Why are the changes needed?

Even after SPARK-57650, the scheduled Build / Java21 and Build / Java25 master lanes fail in the yarn module roughly 50% of runs (e.g. fork run 28151220075 PASS vs 28151247521 FAIL — same commit, 40s apart). All failures are the same six YarnClusterSuite tests timing out after 3 minutes (The code passed to eventually never returned normally ... handle.getState().isFinal() was false).

From the yarn-app-log / unit-tests-log artifacts, the AM/driver comes up, but the executor (and sometimes the AM) intermittently fail to connect back to the driver's RPC server on localhost (java.io.IOException: Failed to connect to localhost/127.0.0.1:<port>, connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses this race exits after the default 3 connection retries, and the application can then never reach a final state.

Does this PR introduce any user-facing change?

No. Test-only.

How was this patch tested?

YarnClusterSuite was previously failing ~50% of the time. With this change the yarn module job was run 6 times on the fork; all 6 passed, with YarnClusterSuite reporting tests=30, failures=0, skipped=0 (the 6 formerly-failing tests now pass):

Before (master, failing): apache/spark run 28148781009 (Build / Java21) — 6 YarnClusterSuite timeouts.
After (this branch): HyukjinKwon/spark runs 28162182834, 28162247111, 28162249819, 28175759262, 28175762257, 28175765871 — yarn job green in all six.

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Claude Code.

…s from runner contention SPARK-57650 fixed the deterministic ACCEPTED-state hang in BaseYarnClusterSuite (maximum-am-resource-percent). The master Build/Java21 and Build/Java25 `yarn` lanes still go red ~50% of runs: YarnClusterSuite tests intermittently time out (`handle.getState().isFinal() was false`) because the AM/executor containers fail to connect to the driver's RPC server on localhost (Connection refused). The in-JVM mini RM+NM, the driver subprocess and the AM/executor JVMs all contend for CPU on a single CI runner, so the driver's accept loop occasionally stalls; an executor that loses the race exits after the default 3 connection retries, and the application can then never finish. Two test-only mitigations in BaseYarnClusterSuite: - Give the mini NodeManager 8GB (and matching max-allocation) so executor allocation is never starved once the ~1.4GB AM is running. - Raise the executor->driver connection retry budget (spark.rpc.io.maxRetries=10, retryWait=2s) so a transient accept stall does not permanently fail the executor. Individual tests can still override. Co-authored-by: Isaac

uros-b · 2026-06-26T07:15:36Z

+    // the application unable to finish and the suite times out. Give the executor->driver
+    // connection a larger retry budget so a transient stall does not permanently fail the app.
+    // These are defaults; individual tests can still override them via extraConf below.
+    props.setProperty("spark.rpc.io.maxRetries", "10")


Note: the new spark.rpc.io.* defaults are set via setProperty BEFORE the loop that copies spark.* JVM system properties, so a -Dspark.rpc.io.maxRetries flag would silently override them; the comment claims only extraConf can override. Moving the two setProperty calls to just before extraConf.foreach removes the ambiguity.

HyukjinKwon changed the title ~~[SPARK-57650][YARN][TESTS][FOLLOWUP] Reduce YarnClusterSuite flakiness from CI runner contention~~ [DO-NOT-MERGE][YARN][TESTS][FOLLOWUP] Reduce YarnClusterSuite flakiness from CI runner contention Jun 25, 2026

HyukjinKwon mentioned this pull request Jun 26, 2026

[SPARK-57650][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from runner contention #56800

Closed

HyukjinKwon marked this pull request as ready for review June 26, 2026 05:52

HyukjinKwon changed the title ~~[DO-NOT-MERGE][YARN][TESTS][FOLLOWUP] Reduce YarnClusterSuite flakiness from CI runner contention~~ [DO-NOT-MERGE][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention Jun 26, 2026

HyukjinKwon marked this pull request as draft June 26, 2026 05:53

HyukjinKwon changed the title ~~[DO-NOT-MERGE][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention~~ [SPARK-57710][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention Jun 26, 2026

HyukjinKwon marked this pull request as ready for review June 26, 2026 05:57

uros-b reviewed Jun 26, 2026

View reviewed changes

uros-b approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57710][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention#56785

[SPARK-57710][YARN][TESTS][FOLLOW-UP] Reduce YarnClusterSuite flakiness from CI runner contention#56785
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/agent8-yarn-cluster-flaky

HyukjinKwon commented Jun 25, 2026

Uh oh!

uros-b Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

HyukjinKwon commented Jun 25, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

uros-b Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants