Skip to content

fix shutdown ordering race condition, prevent use-after-free crash#3004

Open
prasanna-amd wants to merge 5 commits intodevelopfrom
users/prasanna-amd/fix-shutdown-ordering-race
Open

fix shutdown ordering race condition, prevent use-after-free crash#3004
prasanna-amd wants to merge 5 commits intodevelopfrom
users/prasanna-amd/fix-shutdown-ordering-race

Conversation

@prasanna-amd
Copy link
Contributor

Motivation

Fix segmentation fault in rccl tests related to use-after-free race condition

Technical Details

During process exit, HIP runtime's static destructor (amd::RuntimeTearDown) begins
destroying GPU device objects while RCCL's proxy threads are still running. When the
proxy threads call hipFree/hipHostFree to clean up connections, they access
already-destroyed device objects (dev->svmFree(ptr)), causing a use-after-free
crash.

Fix: Register an atexit handler that sets a shutdown flag. Since atexit handlers run
before static destructors, RCCL's free functions can check this flag and skip the
free during shutdown—the OS reclaims all memory at process exit anyway.

JIRA ID

ROCM-1896

Test Plan

tested with all_gather_perf on a 4N system where the segmentation fault can be recreated with one of the nodes having insufficient file descriptors, causing an exit, the main thread goes to exit processing, while the proxy threads are still running, causing the initial segmentation fault. After the fix, the test exits with proper error (when NCCL_DEBUG is enabled), instead of segfault.

@prasanna-amd prasanna-amd force-pushed the users/prasanna-amd/fix-shutdown-ordering-race branch from d55e667 to 2e0b811 Compare February 3, 2026 16:07
Copy link
Contributor

@corey-derochie-amd corey-derochie-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants