JamesAPetts
diff --git a/‎GPU2/README‎
Lines changed: 110 additions & 0 deletions b/‎GPU2/README‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎GPU2/README.2011‎
Lines changed: 197 additions & 0 deletions b/‎GPU2/README.2011‎
Lines changed: 197 additions & 0 deletions
diff --git a/‎GPU2/adjust.f‎
Lines changed: 15 additions & 1 deletion b/‎GPU2/adjust.f‎
Lines changed: 15 additions & 1 deletion
diff --git a/‎GPU2/debug.pdf‎
39.4 KB b/‎GPU2/debug.pdf‎
39.4 KB
@@ -0,0 +1,110 @@
+INSTALLATION NOTES FOR LATEST NBODY6
+
+1. Updates
+We added support for AVX.
+'Makefile*' files were changed and cleaned up.
+
+2. Installation
+(a) For users with AVX support:
+Just type
+ $ make gpu
+to generate the executable ./run/nbody6.gpu
+of which the regular force part is accelerated by GPU
+or
+ $ make avx
+to generate the executable ./run/nbody6.avx
+that runs without GPU but both the regular and
+irregular force part is tuned for AVX.
+In both versions, AVX is used for the irregular
+force part.
+
+(b) For users without AVX support:
+Edit the first line of Makefile to comment it out as
+#avx       = enable
+Then type
+ $ make gpu
+to generate the executable ./run/nbody6.gpu
+or
+ $ make sse
+to generate the executable ./run/nbody6.sse
+
+3. Modifications
+If you are not lucky, you need some modifications
+of Makefile.
+(1) If the CUDA version is less than 5, comment out the line
+      NVCC += -DWITH_CUDA5
+(2) SDK_PATH needs to be set such that 'cutil.h' ('helper_cuda.h'
+    in CUDA 5) is found.
+(3) If your GPU generation is before Kepler, the option
+    -arch=sm_30 needs to be changed to a relevant value.
+(4) If 'libhugetlbfs' is not installed or you don't
+    want to use huge-pages, comment out the line in Makefile.ncode:
+      LD_GPU += -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align
+    For older version of libhugetlbfs, the linker option may be
+      LD_GPU += -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-link=B
+    If you use huge-pages, set three environment variables:
+      HUGETLB_VERBOSE='2'
+      HUGETLB_ELFMAP='W'
+      HUGETLB_MORECORE='yes'
+
+4. For big N simulations
+We added '-fPIC' option for successful compilation when a big
+NMAX (> 100k) is set in 'params.h'.
+With big N, it can cause a stack overflow and segmentation fault.
+In such cases, you can increase the stack size with 'ulimit -s'
+command in BASH or 'limit stacksize' in TCSH.
+Just type
+ $ ulimit -s
+which returns the current value in KB (maybe 10240).
+And you can increase it by
+ $ ulimit -s 20480
+or alternatively 'limit stacksize 20480' or more for TCSH.
+
+5. Current environment at IoA Cambridge:
+CPU      : Core i5-3570 (4 cores, 3.4 GHz)
+GPU      : One GeForce GTX 660Ti (7 SMXs, 1344 CUDA cores)
+OS       : CentOS 6.4 for x86_64
+Compiler : GCC 4.4.7 (default of CentOS)
+CUDA     : CUDA 5.0 Production Release
+
+Here is screen a shot of this configuration integrating
+64k stars from t=0 to t=2.
+
+***********************
+Initializing NBODY6/GPU library
+#CPU 4, #GPU 1
+ device: 0
+ device 0: GeForce GTX 660 Ti
+***********************
+***********************
+Opened NBODY6/GPU library
+#CPU 4, #GPU 1
+ device: 0
+ 0 65546
+nbmax = 65546
+***********************
+**************************** 
+Opening GPUIRR lib. AVX ver. 
+ nmax = 65546, lmax = 500
+**************************** 
+
+***********************
+Closed NBODY6/GPU library
+time send   : 0.875775 sec
+time grav   : 20.004446 sec
+time reduce : 0.271385 sec
+time regtot : 21.151606 sec
+1164.470114 Gflops (gravity part only)
+***********************
+**************************** 
+Closing GPUIRR lib. CPU ver. 
+time grav  : 17.729413 sec
+
+perf grav  : 21.715221 Gflops
+perf grav  : 163.998749 usec
+<#NB>      : 76.406419 
+**************************** 
+
+
+Keigo Nitadori
+April 2013
@@ -0,0 +1,197 @@
+========================================
+ GPU extension for NBODY6 ver. 2.1
+                   22 Aug. 2010, 20 Jan. 2011
+ Keigo Nitadori ([email protected]) 
+               and
+ Sverre Aarseth ([email protected])
+========================================
+
+1. About
+ This is a new library package for enhancing the performance of NBODY6.
+The regular force calculation is accelerated by the use of GPU, whereas
+the irregular force calculation is accelerated by the SSE instruction set
+of x86 CPU and OpenMP for the efficient use of multiple cores.
+Some of the important subroutines of the original NBODY6 are modified
+and libraries written in C++ or CUDA are added.
+
+2. System requirement
+ This module is intended for use on a Linux x86_64 PC with CUDA enabled GPU.
+However, one can try to run it on Macintosh/Linux-32bit/Windows.
+
+ 2-1. Hardware
+  CPU : x86_64 with SSE3 support (from the 90nm generation for Intel/AMD).
+  GPU : NVIDA GeForce/Tesla with CUDA support. GT200 or Fermi generation.
+ 2-2. Software
+  OS       : Any Linux for x86_64 where you can install CUDA environment.
+  Compiler : GCC 4.1.2 or later with OpenMP enabled. 
+             C++ and FORTRAN support is required.
+  CUDA     : Toolkit and SDK 3.0 or later. SDK is only relevant for 'cutil.h'.
+
+For reference, we list the environment used for development.
+ CPU      : Core i7 920
+ GPU      : Two GeForce GTX 470
+ OS       : CentOS 5.5 for x86_64
+ Compiler : GCC 4.1.2 (with C++ and FORTRAN)
+ CUDA     : Toolkit and SDK 3.1
+
+ Note that one CPU socket (4 or 6 cores) for one (Fermi generation) GPU looks 
+a well balanced combination. The regular force part is easily accelerated 
+with GPU. However, the irregular force part is not easily accelerated on
+the GPU and is now performed on the host with the help of SSE and OpenMP.
+ Some suggestions for hardware configuration:
+ Entry choice   ( $1,000): One Core i7 870 (Lynfield)    + one GeForce GTX 460 (GF104)
+ High-end choice($10,000): Two Xeon X5680  (Westmere-EP) + two Tesla C2050     (GF100)
+
+3. Installation
+ Extract the file 'gpu2.tar.gz' in the directory 'Nbody6'.
+After extracting, the directory structure would look as follows:
+
+Nbody6 +
+       |-Ncode            : Original NBODY6
+       |-GPU2  +          : Makefile and FORTRAN files
+       |       |-lib      : Regular-force library (in CUDA)
+       |       |-irrlib   : Irregular-force library (in C++)
+       |       |-run
+       |-Nchain
+       |-...
+
+Go to the directory 'GPU2',
+ > cd GPU2
+and execute the shell script
+ > ./install.sh
+to make symbolic links of 'params.h' and 'common6.h'.
+Then, just make it
+ > make gpu
+to obtain the executable 'nbody6.gpu' in the directory 'run'.
+
+4. GPU dependent parameters
+ Some parameters need to be given at compile time to adjust the number of
+thread blocks to the physical number of processors of the GPU used.
+These values are defined in the file 'GPU2/lib/gpunb.reduce.cu', and some
+examples are:
+ for GeForce GTX 460/470 or Tesla C2050,
+  #define NJBLOCK  14
+  #define NXREDUCE 16
+ or,
+  #define NJBLOCK  28
+  #define NXREDUCE 32
+ for GeForce GTX 280 or Tesla C1060,
+  #define NJBLOCK  30
+  #define NXREDUCE 32
+ etc...
+ One compiler option needs to be modified, depending on the GPU generation.
+Edit the file 'GPU/Makefie' on the line
+  NVCC += -arch sm_20 -Xptxas -dlcm=cg
+For GTX 280 or C1060, it should be 'sm_13' and for GTX 460, 'sm_21'.
+
+5. Environment variables
+ By default, the library automatically finds and uses all the GPUs installed 
+and all the CPU threads. In case you want to run multiple jobs on one PC 
+with multiple GPUs, you need to specify the list of GPUs and the number
+of CPU threads. For example, if 2 GPUs and 8 host threads are available, 
+two jobs can be run by:
+ > GPU_LIST="0" OMP_NUM_THREADS=4 ../nbody6.gpu < in1 > out1 &
+ > GPU_LIST="1" OMP_NUM_THREADS=4 ../nbody6.gpu < in2 > out2 &
+The default is equivalent to,
+ > GPU_LIST="0 1" OMP_NUM_THREADS=8 ../nbody6.gpu < in > out &
+
+6. Output
+Each regular or irregular library outputs some message to the screen (stderr)
+when it is opened and closed. At the close time, both of the libraries give
+some profiling information. An example of the output to the screen is as
+follows:
+
+***********************
+Initializing NBODY6/GPU library
+#CPU 8, #GPU 2
+ device: 0 1
+ device 0: GeForce GTX 470
+ device 1: GeForce GTX 470
+***********************
+***********************
+Opened NBODY6/GPU library
+#CPU 8, #GPU 2
+ device: 0 1
+ 0 32768 65536
+nbmax = 65536
+***********************
+**************************** 
+Opening GPUIRR lib. SSE ver. 
+ nmax = 65546, lmax = 500
+**************************** 
+***********************
+Closed NBODY6/GPU library
+time send   : 1.101684 sec
+time grav   : 17.795590 sec
+time reduce : 0.387609 sec
+1315.947465 Gflops (gravity part only)
+***********************
+**************************** 
+Closing GPUIRR lib. CPU ver. 
+time pred  : 8.766780 sec
+time pact  : 9.664287 sec
+time grav  : 19.304246 sec
+time onep  : 0.000000 sec
+
+perf grav  : 20.604658 Gflops
+perf pred  : 8.123547 nsec
+perf pact  : 125.747625 nsec
+**************************** 
+
+7. Performance tuning using HUGEPAGE
+ The default page size of x86 architecture is 4kB, which sometimes causes a
+performance loss in HPC (high performance computing) applications.
+Recent x86 CPUs and Linux kernel support larger page (huge page) whose size 
+is 2MB.
+ We have seen that the use of huge page improves the performance and 
+describe the way to set up.
+
+(i)  Install 'libhugetlbfs' package
+ For CentOS 5.5, just type (as superuser),
+ > yum install libhugetlbfs*
+
+(ii) Allocate huge pages
+ For NBODY6, 512 pages = 1 GB would be sufficient allocation, if N < 100k.
+To allocate, type as superuser,
+ > echo 512 > /proc/sys/vm/nr_hugepages
+ It is recommended to write it in the boot sequence script (/etc/rc.local)
+for safe allocation.
+ You can check the allocation status with the command
+ > grep Huge /proc/meminfo
+to obtain, 
+  HugePages_Total:   512
+  HugePages_Free:    512
+  HugePages_Rsvd:      0
+  Hugepagesize:     2048 kB
+
+(iii) Mount the filesystem
+ You need to mount hugetlbfs to some mount point (any mount point will do).
+Here, we just show an example line for 'fstab'.
+  hugetlbfs		/libhugetlbfs		huge tlbfs	mode=0777	0 0
+
+(iv) Compiling NBODY6
+ NBODY6 needs to be re-linked to put the common arrays on the huge-page.
+Comment out the following line from 'GPU2/Makefile_gpu' and link again.
+  #LD_GPU += -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-link=B
+
+(v)  Environment variable
+ The following environment variable should be set to get the dynamically
+allocated memory from the huge pages:
+  HUGETLB_MORECORE=yes
+
+(vi) Run and watch
+ While the program is running, you can watch the use of huge pages:
+ > grep Huge /proc/meminfo
+
+8. Conclusion
+ Significant speed-up has been achieved from the previously released version.
+The bottleneck of the old version with an accelerated regular force
+calculation on the GPU was the remaining part performed on the host CPU.
+The iregular force, predictor and corrector part have been fine-tuned with
+SSE and OpenMP. 
+ Provisional timing tests have been carried out. Compared with the previous
+version (Aug. 2010), the speed-up is at least a factor of 2 for N=64k or N=100k. 
+Based on the system presented in section 3 with 2 GPUs, typical wall-clock
+time from T=0 to T=2 (excluding the initialization) is 74 sec for N=64k and
+160 sec for N=100k. For predictions at regular force times we use the scalar
+version of cxvpred.cpp which is nearly twice as fast as the SSE version.
@@ -6,6 +6,9 @@ SUBROUTINE ADJUST
 *
       INCLUDE 'common6.h'
       COMMON/ECHAIN/  ECH
+*       Jpetts - added access to common galaxy variables here.
+      COMMON/GALAXY/  GMG,RG(3),VG(3),FG(3),FGD(3),TG,
+     &               OMEGA,DISK,A,B,V02,RL2,GMB,AR,GAM,ZDUM(7)
       SAVE  DTOFF
       DATA  DTOFF /100.0D0/
 *
@@ -168,6 +171,13 @@ SUBROUTINE ADJUST
 *
 *       Jpetts - recalculate dynamical friction variables
       CALL DYNFVARS
+
+!TEMP       -     Print out DYNFCOEF in to some file
+
+      WRITE(101,43) TTOT, DYNFCOEF
+   43 FORMAT (F10.5,F10.5)
+
+!TEMP
 *
 *       Scale average & maximum core density by the mean value.
       RHOD = 4.0*TWOPI*RHOD*RSCALE**3/(3.0*ZMASS)
@@ -283,6 +293,9 @@ SUBROUTINE ADJUST
      &          '  TC =',0P,I5,'  DELTA =',1P,E9.1,'  E(3) =',0P,F10.6,
      &          '  DETOT =',F10.6,'  WTIME =',I4,2I3)
       CALL FLUSH(6)
+
+
+
 *
 *       Perform automatic error control (RETURN on restart with KZ(2) > 1).
       CALL CHECK(DE)
@@ -299,7 +312,8 @@ SUBROUTINE ADJUST
       END IF
 *
 *       Check correction for c.m. displacements.
-      IF (KZ(31).GT.0) THEN
+*       Jpetts - Don't recentre when object unbound
+      IF (KZ(31).GT.0 .AND. MASSCL .NE. 0) THEN
           CALL CMCORR
       END IF
 *