Skip to content

Fix symmetric-run teardown errors on Slurm when processes exit as zombies#62591

Open
ssam18 wants to merge 3 commits intoray-project:masterfrom
ssam18:fix/symmetric-run-teardown-errors
Open

Fix symmetric-run teardown errors on Slurm when processes exit as zombies#62591
ssam18 wants to merge 3 commits intoray-project:masterfrom
ssam18:fix/symmetric-run-teardown-errors

Conversation

@ssam18
Copy link
Copy Markdown

@ssam18 ssam18 commented Apr 14, 2026

When running ray symmetric-run on Slurm, Ray processes exit as zombies during teardown because ray stop is not their parent process and psutil.wait_procs() can't reap them via os.waitpid() that causes a false "Stopped 0 out of N" warning even though the job completed cleanly. Separately, worker nodes were always hitting a CalledProcessError on normal shutdown because ray start --block exits non-zero when GCS goes down, which is expected behavior and not an error. Both issues are fixed: zombies are now counted as pre-stopped before the signal loop, and check=True is removed from the worker's blocking ray start call. Fixes #62390.

@ssam18 ssam18 requested a review from a team as a code owner April 14, 2026 02:16
@ssam18 ssam18 force-pushed the fix/symmetric-run-teardown-errors branch from 0887b9c to 2154977 Compare April 14, 2026 02:17
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the process termination logic to include zombie processes and modifies worker node startup by removing strict error checking. Review feedback highlights that the zombie process counting is over-inclusive and should be filtered to Ray-specific processes. Additionally, removing the error check in the worker startup command may silently ignore failures, suggesting a need for explicit return code verification or better documentation.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 2154977. Configure here.

@ssam18 ssam18 force-pushed the fix/symmetric-run-teardown-errors branch from 2154977 to 7207ffd Compare April 14, 2026 02:22
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Symmetric run raise error when terminating

1 participant