Skip to content

Disable task sync in CUDA#65

Merged
utkinis merged 41 commits intomainfrom
lr/custream-sync
Apr 7, 2025
Merged

Disable task sync in CUDA#65
utkinis merged 41 commits intomainfrom
lr/custream-sync

Conversation

@luraess
Copy link
Member

@luraess luraess commented Feb 17, 2025

This PR implements the changes needed to support opting out of syncing on task switching for the CUDA backend JuliaGPU/CUDA.jl#2662.

It thus avoids serialisation of async operations and brings back async execution
Screenshot 2025-02-17 at 23 47 36

I am unsure whether this is the best way of addressing things tho...

Requires CUDA#vc/unsafe_stream_switching for testing

@luraess
Copy link
Member Author

luraess commented Feb 20, 2025

According to our latest discussion, we may rework the communication / computation overlap in Chmy. By serialising all the compute (Outer -> BCs -> Inner) we could then spawn a new task only for doing IO, i.e., MPI comm preceded by copying from the array view into buffers and back into the array upon termination.

As I understand, this would only require opting out from implicit sync using e.g. unsafe_disable_task_sync! on the buffer arrays without needing further engineering.

It would still be nice to update CUDA.jl's proposed unsafe_disable_task_sync!(arr) to return the array instead of that status as to e.g. use the following construct:

a = CUDA.rand(2, 1) |> CUDA.unsafe_disable_task_sync!

@luraess
Copy link
Member Author

luraess commented Mar 24, 2025

@utkinis what if we use the above strategy to circumvent the implicit sync issue with CUDA.jl? This would actually make the launch_with_bc function "unsafe" wrt CUDA.jl, but would confine the unsafeness ti this function and not interfere with other parts of the code. Testing on ALPS, the approach seems to work and give good perf and ... overlap

Screenshot 2025-03-24 at 23 12 36

Happy to get your feedback and if positive to going this way, potential hints on improving my prototype attempt

@utkinis
Copy link
Member

utkinis commented Mar 24, 2025

Thanks @luraess for digging further into this. Yes, checking the arguments at runtime could be a good way to solve the issue. One needs to recursively check all Julia composite types that could be passed to kernels, namely, structs and tuples (Not sure about arrays and refs, I guess only the bits types could be passed to GPU kernels, but might be wrong).

Also, the implicit sync happens at the array level, so the recursion must descend to the level of arrays and not just Fields, as the user might just pass a regular GPU array. I left the specific comments in the PR. Otherwise, I'm positive to make it work this way.

test/runtests.jl Outdated
if backend != "CPU"
Pkg.add(backend)
end
# tmp fix to have the disable/enable task sync feature until merged in CUDA.jl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be removed before merge

@luraess
Copy link
Member Author

luraess commented Apr 2, 2025

And the ALPS check that things overlap as they should
Screenshot 2025-04-02 at 12 16 16

@luraess
Copy link
Member Author

luraess commented Apr 2, 2025

EDITED

From my side this could be ready to go. Before merging, one needs to:

@luraess
Copy link
Member Author

luraess commented Apr 5, 2025

Upon final rework of JuliaGPU/CUDA.jl#2662, the approach still works and successfully overlaps
Screenshot 2025-04-06 at 00 24 09

@luraess
Copy link
Member Author

luraess commented Apr 7, 2025

This should be ready to go from my side.

@luraess luraess requested a review from utkinis April 7, 2025 08:10
@utkinis utkinis merged commit 26c92d4 into main Apr 7, 2025
9 checks passed
@utkinis utkinis deleted the lr/custream-sync branch April 7, 2025 15:58
@luraess luraess mentioned this pull request Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants