|
| 1 | +======================================== |
| 2 | + GPU extension for NBODY6 ver. 2.1 |
| 3 | + 22 Aug. 2010, 20 Jan. 2011 |
| 4 | + Keigo Nitadori ( [email protected]) |
| 5 | + and |
| 6 | + Sverre Aarseth ( [email protected]) |
| 7 | +======================================== |
| 8 | + |
| 9 | +1. About |
| 10 | + This is a new library package for enhancing the performance of NBODY6. |
| 11 | +The regular force calculation is accelerated by the use of GPU, whereas |
| 12 | +the irregular force calculation is accelerated by the SSE instruction set |
| 13 | +of x86 CPU and OpenMP for the efficient use of multiple cores. |
| 14 | +Some of the important subroutines of the original NBODY6 are modified |
| 15 | +and libraries written in C++ or CUDA are added. |
| 16 | + |
| 17 | +2. System requirement |
| 18 | + This module is intended for use on a Linux x86_64 PC with CUDA enabled GPU. |
| 19 | +However, one can try to run it on Macintosh/Linux-32bit/Windows. |
| 20 | + |
| 21 | + 2-1. Hardware |
| 22 | + CPU : x86_64 with SSE3 support (from the 90nm generation for Intel/AMD). |
| 23 | + GPU : NVIDA GeForce/Tesla with CUDA support. GT200 or Fermi generation. |
| 24 | + 2-2. Software |
| 25 | + OS : Any Linux for x86_64 where you can install CUDA environment. |
| 26 | + Compiler : GCC 4.1.2 or later with OpenMP enabled. |
| 27 | + C++ and FORTRAN support is required. |
| 28 | + CUDA : Toolkit and SDK 3.0 or later. SDK is only relevant for 'cutil.h'. |
| 29 | + |
| 30 | +For reference, we list the environment used for development. |
| 31 | + CPU : Core i7 920 |
| 32 | + GPU : Two GeForce GTX 470 |
| 33 | + OS : CentOS 5.5 for x86_64 |
| 34 | + Compiler : GCC 4.1.2 (with C++ and FORTRAN) |
| 35 | + CUDA : Toolkit and SDK 3.1 |
| 36 | + |
| 37 | + Note that one CPU socket (4 or 6 cores) for one (Fermi generation) GPU looks |
| 38 | +a well balanced combination. The regular force part is easily accelerated |
| 39 | +with GPU. However, the irregular force part is not easily accelerated on |
| 40 | +the GPU and is now performed on the host with the help of SSE and OpenMP. |
| 41 | + Some suggestions for hardware configuration: |
| 42 | + Entry choice ( $1,000): One Core i7 870 (Lynfield) + one GeForce GTX 460 (GF104) |
| 43 | + High-end choice($10,000): Two Xeon X5680 (Westmere-EP) + two Tesla C2050 (GF100) |
| 44 | + |
| 45 | +3. Installation |
| 46 | + Extract the file 'gpu2.tar.gz' in the directory 'Nbody6'. |
| 47 | +After extracting, the directory structure would look as follows: |
| 48 | + |
| 49 | +Nbody6 + |
| 50 | + |-Ncode : Original NBODY6 |
| 51 | + |-GPU2 + : Makefile and FORTRAN files |
| 52 | + | |-lib : Regular-force library (in CUDA) |
| 53 | + | |-irrlib : Irregular-force library (in C++) |
| 54 | + | |-run |
| 55 | + |-Nchain |
| 56 | + |-... |
| 57 | + |
| 58 | +Go to the directory 'GPU2', |
| 59 | + > cd GPU2 |
| 60 | +and execute the shell script |
| 61 | + > ./install.sh |
| 62 | +to make symbolic links of 'params.h' and 'common6.h'. |
| 63 | +Then, just make it |
| 64 | + > make gpu |
| 65 | +to obtain the executable 'nbody6.gpu' in the directory 'run'. |
| 66 | + |
| 67 | +4. GPU dependent parameters |
| 68 | + Some parameters need to be given at compile time to adjust the number of |
| 69 | +thread blocks to the physical number of processors of the GPU used. |
| 70 | +These values are defined in the file 'GPU2/lib/gpunb.reduce.cu', and some |
| 71 | +examples are: |
| 72 | + for GeForce GTX 460/470 or Tesla C2050, |
| 73 | + #define NJBLOCK 14 |
| 74 | + #define NXREDUCE 16 |
| 75 | + or, |
| 76 | + #define NJBLOCK 28 |
| 77 | + #define NXREDUCE 32 |
| 78 | + for GeForce GTX 280 or Tesla C1060, |
| 79 | + #define NJBLOCK 30 |
| 80 | + #define NXREDUCE 32 |
| 81 | + etc... |
| 82 | + One compiler option needs to be modified, depending on the GPU generation. |
| 83 | +Edit the file 'GPU/Makefie' on the line |
| 84 | + NVCC += -arch sm_20 -Xptxas -dlcm=cg |
| 85 | +For GTX 280 or C1060, it should be 'sm_13' and for GTX 460, 'sm_21'. |
| 86 | + |
| 87 | +5. Environment variables |
| 88 | + By default, the library automatically finds and uses all the GPUs installed |
| 89 | +and all the CPU threads. In case you want to run multiple jobs on one PC |
| 90 | +with multiple GPUs, you need to specify the list of GPUs and the number |
| 91 | +of CPU threads. For example, if 2 GPUs and 8 host threads are available, |
| 92 | +two jobs can be run by: |
| 93 | + > GPU_LIST="0" OMP_NUM_THREADS=4 ../nbody6.gpu < in1 > out1 & |
| 94 | + > GPU_LIST="1" OMP_NUM_THREADS=4 ../nbody6.gpu < in2 > out2 & |
| 95 | +The default is equivalent to, |
| 96 | + > GPU_LIST="0 1" OMP_NUM_THREADS=8 ../nbody6.gpu < in > out & |
| 97 | + |
| 98 | +6. Output |
| 99 | +Each regular or irregular library outputs some message to the screen (stderr) |
| 100 | +when it is opened and closed. At the close time, both of the libraries give |
| 101 | +some profiling information. An example of the output to the screen is as |
| 102 | +follows: |
| 103 | + |
| 104 | +*********************** |
| 105 | +Initializing NBODY6/GPU library |
| 106 | +#CPU 8, #GPU 2 |
| 107 | + device: 0 1 |
| 108 | + device 0: GeForce GTX 470 |
| 109 | + device 1: GeForce GTX 470 |
| 110 | +*********************** |
| 111 | +*********************** |
| 112 | +Opened NBODY6/GPU library |
| 113 | +#CPU 8, #GPU 2 |
| 114 | + device: 0 1 |
| 115 | + 0 32768 65536 |
| 116 | +nbmax = 65536 |
| 117 | +*********************** |
| 118 | +**************************** |
| 119 | +Opening GPUIRR lib. SSE ver. |
| 120 | + nmax = 65546, lmax = 500 |
| 121 | +**************************** |
| 122 | +*********************** |
| 123 | +Closed NBODY6/GPU library |
| 124 | +time send : 1.101684 sec |
| 125 | +time grav : 17.795590 sec |
| 126 | +time reduce : 0.387609 sec |
| 127 | +1315.947465 Gflops (gravity part only) |
| 128 | +*********************** |
| 129 | +**************************** |
| 130 | +Closing GPUIRR lib. CPU ver. |
| 131 | +time pred : 8.766780 sec |
| 132 | +time pact : 9.664287 sec |
| 133 | +time grav : 19.304246 sec |
| 134 | +time onep : 0.000000 sec |
| 135 | + |
| 136 | +perf grav : 20.604658 Gflops |
| 137 | +perf pred : 8.123547 nsec |
| 138 | +perf pact : 125.747625 nsec |
| 139 | +**************************** |
| 140 | + |
| 141 | +7. Performance tuning using HUGEPAGE |
| 142 | + The default page size of x86 architecture is 4kB, which sometimes causes a |
| 143 | +performance loss in HPC (high performance computing) applications. |
| 144 | +Recent x86 CPUs and Linux kernel support larger page (huge page) whose size |
| 145 | +is 2MB. |
| 146 | + We have seen that the use of huge page improves the performance and |
| 147 | +describe the way to set up. |
| 148 | + |
| 149 | +(i) Install 'libhugetlbfs' package |
| 150 | + For CentOS 5.5, just type (as superuser), |
| 151 | + > yum install libhugetlbfs* |
| 152 | + |
| 153 | +(ii) Allocate huge pages |
| 154 | + For NBODY6, 512 pages = 1 GB would be sufficient allocation, if N < 100k. |
| 155 | +To allocate, type as superuser, |
| 156 | + > echo 512 > /proc/sys/vm/nr_hugepages |
| 157 | + It is recommended to write it in the boot sequence script (/etc/rc.local) |
| 158 | +for safe allocation. |
| 159 | + You can check the allocation status with the command |
| 160 | + > grep Huge /proc/meminfo |
| 161 | +to obtain, |
| 162 | + HugePages_Total: 512 |
| 163 | + HugePages_Free: 512 |
| 164 | + HugePages_Rsvd: 0 |
| 165 | + Hugepagesize: 2048 kB |
| 166 | + |
| 167 | +(iii) Mount the filesystem |
| 168 | + You need to mount hugetlbfs to some mount point (any mount point will do). |
| 169 | +Here, we just show an example line for 'fstab'. |
| 170 | + hugetlbfs /libhugetlbfs huge tlbfs mode=0777 0 0 |
| 171 | + |
| 172 | +(iv) Compiling NBODY6 |
| 173 | + NBODY6 needs to be re-linked to put the common arrays on the huge-page. |
| 174 | +Comment out the following line from 'GPU2/Makefile_gpu' and link again. |
| 175 | + #LD_GPU += -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-link=B |
| 176 | + |
| 177 | +(v) Environment variable |
| 178 | + The following environment variable should be set to get the dynamically |
| 179 | +allocated memory from the huge pages: |
| 180 | + HUGETLB_MORECORE=yes |
| 181 | + |
| 182 | +(vi) Run and watch |
| 183 | + While the program is running, you can watch the use of huge pages: |
| 184 | + > grep Huge /proc/meminfo |
| 185 | + |
| 186 | +8. Conclusion |
| 187 | + Significant speed-up has been achieved from the previously released version. |
| 188 | +The bottleneck of the old version with an accelerated regular force |
| 189 | +calculation on the GPU was the remaining part performed on the host CPU. |
| 190 | +The iregular force, predictor and corrector part have been fine-tuned with |
| 191 | +SSE and OpenMP. |
| 192 | + Provisional timing tests have been carried out. Compared with the previous |
| 193 | +version (Aug. 2010), the speed-up is at least a factor of 2 for N=64k or N=100k. |
| 194 | +Based on the system presented in section 3 with 2 GPUs, typical wall-clock |
| 195 | +time from T=0 to T=2 (excluding the initialization) is 74 sec for N=64k and |
| 196 | +160 sec for N=100k. For predictions at regular force times we use the scalar |
| 197 | +version of cxvpred.cpp which is nearly twice as fast as the SSE version. |
0 commit comments