Initial nccl implementation -- experimental#1
Open
aniabrown-nvidia wants to merge 8 commits intoalejandrogallo:cudafrom
Open
Initial nccl implementation -- experimental#1aniabrown-nvidia wants to merge 8 commits intoalejandrogallo:cudafrom
aniabrown-nvidia wants to merge 8 commits intoalejandrogallo:cudafrom
Conversation
…ingle user managed workspace rather than reallocating each iteration
…s are resident on gpu
alejandrogallo
pushed a commit
that referenced
this pull request
Apr 28, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Initial nccl implementation using only the default stream. Contains some temporary fixes to get code to compile and some temporary experimentation into mpi performance -- will require cleanup later. Suggest merging into a separate nccl branch for now.
Requires linking additional libraries:
-lnvToolsExt -lncclSome extra context for commits:
01a6df2, e77d120 -- very temporary fixes to compile time and run time errors outside of dgemms -- needs looking into
5c4ce5b -- Bug fix -- source memory was not being allocated. At time of writing commit I had assumed memory for sources was allocated every iteration. This is not the case, but we may still want to allocate a pool of memory for atrip early on in program execution, particularly to address the following commit.
5c4ce5b -- This alloc was taking a similar amount of time as the dgemm for No=50. Need to check if this takes non-negligible time for larger sizes
46c56b9 -- removing host-device transfer that's not needed in the gpu source version
e95ca45 -- this 'warm up' was for experimentation only -- used to test point-to-point handle creation at start of app
7878a14 -- switching point to point comms to use nccl