Fix symmetric-run teardown errors on Slurm when processes exit as zombies#62591
Open
ssam18 wants to merge 3 commits intoray-project:masterfrom
Open
Fix symmetric-run teardown errors on Slurm when processes exit as zombies#62591ssam18 wants to merge 3 commits intoray-project:masterfrom
ssam18 wants to merge 3 commits intoray-project:masterfrom
Conversation
0887b9c to
2154977
Compare
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the process termination logic to include zombie processes and modifies worker node startup by removing strict error checking. Review feedback highlights that the zombie process counting is over-inclusive and should be filtered to Ray-specific processes. Additionally, removing the error check in the worker startup command may silently ignore failures, suggesting a need for explicit return code verification or better documentation.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Reviewed by Cursor Bugbot for commit 2154977. Configure here.
… on worker block Signed-off-by: Samaresh Kumar Singh <[email protected]>
…de as warning Signed-off-by: Samaresh Kumar Singh <[email protected]>
2154977 to
7207ffd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

When running
ray symmetric-runon Slurm, Ray processes exit as zombies during teardown becauseray stopis not their parent process andpsutil.wait_procs()can't reap them viaos.waitpid()that causes a false "Stopped 0 out of N" warning even though the job completed cleanly. Separately, worker nodes were always hitting aCalledProcessErroron normal shutdown becauseray start --blockexits non-zero when GCS goes down, which is expected behavior and not an error. Both issues are fixed: zombies are now counted as pre-stopped before the signal loop, andcheck=Trueis removed from the worker's blockingray startcall. Fixes #62390.