Disable task sync in CUDA by luraess · Pull Request #65 · PTsolvers/Chmy.jl

luraess · 2025-02-17T22:56:28Z

This PR implements the changes needed to support opting out of syncing on task switching for the CUDA backend JuliaGPU/CUDA.jl#2662.

It thus avoids serialisation of async operations and brings back async execution

I am unsure whether this is the best way of addressing things tho...

Requires CUDA#vc/unsafe_stream_switching for testing

luraess · 2025-02-20T21:39:24Z

According to our latest discussion, we may rework the communication / computation overlap in Chmy. By serialising all the compute (Outer -> BCs -> Inner) we could then spawn a new task only for doing IO, i.e., MPI comm preceded by copying from the array view into buffers and back into the array upon termination.

As I understand, this would only require opting out from implicit sync using e.g. unsafe_disable_task_sync! on the buffer arrays without needing further engineering.

It would still be nice to update CUDA.jl's proposed unsafe_disable_task_sync!(arr) to return the array instead of that status as to e.g. use the following construct:

a = CUDA.rand(2, 1) |> CUDA.unsafe_disable_task_sync!

src/KernelLaunch.jl

luraess · 2025-03-24T22:25:23Z

@utkinis what if we use the above strategy to circumvent the implicit sync issue with CUDA.jl? This would actually make the launch_with_bc function "unsafe" wrt CUDA.jl, but would confine the unsafeness ti this function and not interfere with other parts of the code. Testing on ALPS, the approach seems to work and give good perf and ... overlap

Happy to get your feedback and if positive to going this way, potential hints on improving my prototype attempt

ext/ChmyCUDAExt/ChmyCUDAExt.jl

src/KernelLaunch.jl

utkinis · 2025-03-24T23:01:14Z

Thanks @luraess for digging further into this. Yes, checking the arguments at runtime could be a good way to solve the issue. One needs to recursively check all Julia composite types that could be passed to kernels, namely, structs and tuples (Not sure about arrays and refs, I guess only the bits types could be passed to GPU kernels, but might be wrong).

Also, the implicit sync happens at the array level, so the recursion must descend to the level of arrays and not just Fields, as the user might just pass a regular GPU array. I left the specific comments in the PR. Otherwise, I'm positive to make it work this way.

test/test_kernel_launch.jl

luraess · 2025-04-02T10:06:13Z

test/runtests.jl

    if backend != "CPU"
        Pkg.add(backend)
    end
+    # tmp fix to have the disable/enable task sync feature until merged in CUDA.jl


Needs to be removed before merge

luraess · 2025-04-02T10:16:52Z

And the ALPS check that things overlap as they should

luraess · 2025-04-02T10:20:31Z

EDITED

From my side this could be ready to go. Before merging, one needs to:

revert testing from a specific branch
~~bump minor (or patch?) version~~
wait Support disabling implicit synchronization JuliaGPU/CUDA.jl#2662 gets merged and release tagged
update CUDA compat bounds

src/Architectures.jl

src/Chmy.jl

luraess · 2025-04-05T22:29:46Z

Upon final rework of JuliaGPU/CUDA.jl#2662, the approach still works and successfully overlaps

because of JuliaGPU/CUDA.jl#2662

luraess · 2025-04-07T08:10:35Z

This should be ready to go from my side.

luraess added 2 commits February 17, 2025 23:50

disable task sync in CUDA

8c47c39

cleanup

fb2da81

luraess requested a review from utkinis February 17, 2025 22:56

luraess mentioned this pull request Feb 17, 2025

Support disabling implicit synchronization JuliaGPU/CUDA.jl#2662

Merged

luraess self-assigned this Feb 18, 2025

luraess added 3 commits February 18, 2025 11:37

fixup ext

6d62cf4

fixup!

97061e2

fixup!

febdb64

luraess mentioned this pull request Feb 19, 2025

Ability to opt out of / improved automatic synchronization between tasks for shared array usage JuliaGPU/CUDA.jl#2617

Closed

luraess added 3 commits March 24, 2025 23:13

Add function

a0be2bb

Disable task sync on all field by default

b7516f3

A hack to disable task sync on the fly

9b20a77

luraess commented Mar 24, 2025

View reviewed changes

src/KernelLaunch.jl Outdated Show resolved Hide resolved

Merge branch 'main' into lr/custream-sync

a7b2689

utkinis reviewed Mar 24, 2025

View reviewed changes

ext/ChmyCUDAExt/ChmyCUDAExt.jl Outdated Show resolved Hide resolved

utkinis reviewed Mar 24, 2025

View reviewed changes

src/KernelLaunch.jl Outdated Show resolved Hide resolved

Removes non-used async kwarg

6c3e6dc

utkinis reviewed Mar 24, 2025

View reviewed changes

src/KernelLaunch.jl Outdated Show resolved Hide resolved

Fix dispatch

0c30579

luraess added 6 commits March 25, 2025 00:03

Fixup dispatch

ba5bccb

Try out new recursive fun. Co-authored by @albert-de-montserrat

cc994ea

Add tests

b0679f9

Up

7f5c3ba

fixup

2415e38

fixup!

b44d047

albert-de-montserrat reviewed Apr 1, 2025

View reviewed changes

test/test_kernel_launch.jl Outdated Show resolved Hide resolved

fixup

3c93f84