Conversation
|
According to our latest discussion, we may rework the communication / computation overlap in Chmy. By serialising all the compute (Outer -> BCs -> Inner) we could then spawn a new task only for doing IO, i.e., MPI comm preceded by copying from the array view into buffers and back into the array upon termination. As I understand, this would only require opting out from implicit sync using e.g. It would still be nice to update CUDA.jl's proposed |
|
@utkinis what if we use the above strategy to circumvent the implicit sync issue with CUDA.jl? This would actually make the Happy to get your feedback and if positive to going this way, potential hints on improving my prototype attempt |
|
Thanks @luraess for digging further into this. Yes, checking the arguments at runtime could be a good way to solve the issue. One needs to recursively check all Julia composite types that could be passed to kernels, namely, structs and tuples (Not sure about arrays and refs, I guess only the bits types could be passed to GPU kernels, but might be wrong). Also, the implicit sync happens at the array level, so the recursion must descend to the level of arrays and not just |
test/runtests.jl
Outdated
| if backend != "CPU" | ||
| Pkg.add(backend) | ||
| end | ||
| # tmp fix to have the disable/enable task sync feature until merged in CUDA.jl |
There was a problem hiding this comment.
Needs to be removed before merge
|
EDITED From my side this could be ready to go. Before merging, one needs to:
|
|
Upon final rework of JuliaGPU/CUDA.jl#2662, the approach still works and successfully overlaps |
|
This should be ready to go from my side. |



This PR implements the changes needed to support opting out of syncing on task switching for the CUDA backend JuliaGPU/CUDA.jl#2662.
It thus avoids serialisation of async operations and brings back async execution

I am unsure whether this is the best way of addressing things tho...
Requires
CUDA#vc/unsafe_stream_switchingfor testing