Skip to content

Commit bc1d1d7

Browse files
committed
added support for gamma=0.25 Dehnen model
1 parent 85e30c3 commit bc1d1d7

50 files changed

Lines changed: 6648 additions & 191 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

GPU2/README

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
INSTALLATION NOTES FOR LATEST NBODY6
2+
3+
1. Updates
4+
We added support for AVX.
5+
'Makefile*' files were changed and cleaned up.
6+
7+
2. Installation
8+
(a) For users with AVX support:
9+
Just type
10+
$ make gpu
11+
to generate the executable ./run/nbody6.gpu
12+
of which the regular force part is accelerated by GPU
13+
or
14+
$ make avx
15+
to generate the executable ./run/nbody6.avx
16+
that runs without GPU but both the regular and
17+
irregular force part is tuned for AVX.
18+
In both versions, AVX is used for the irregular
19+
force part.
20+
21+
(b) For users without AVX support:
22+
Edit the first line of Makefile to comment it out as
23+
#avx = enable
24+
Then type
25+
$ make gpu
26+
to generate the executable ./run/nbody6.gpu
27+
or
28+
$ make sse
29+
to generate the executable ./run/nbody6.sse
30+
31+
3. Modifications
32+
If you are not lucky, you need some modifications
33+
of Makefile.
34+
(1) If the CUDA version is less than 5, comment out the line
35+
NVCC += -DWITH_CUDA5
36+
(2) SDK_PATH needs to be set such that 'cutil.h' ('helper_cuda.h'
37+
in CUDA 5) is found.
38+
(3) If your GPU generation is before Kepler, the option
39+
-arch=sm_30 needs to be changed to a relevant value.
40+
(4) If 'libhugetlbfs' is not installed or you don't
41+
want to use huge-pages, comment out the line in Makefile.ncode:
42+
LD_GPU += -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-align
43+
For older version of libhugetlbfs, the linker option may be
44+
LD_GPU += -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-link=B
45+
If you use huge-pages, set three environment variables:
46+
HUGETLB_VERBOSE='2'
47+
HUGETLB_ELFMAP='W'
48+
HUGETLB_MORECORE='yes'
49+
50+
4. For big N simulations
51+
We added '-fPIC' option for successful compilation when a big
52+
NMAX (> 100k) is set in 'params.h'.
53+
With big N, it can cause a stack overflow and segmentation fault.
54+
In such cases, you can increase the stack size with 'ulimit -s'
55+
command in BASH or 'limit stacksize' in TCSH.
56+
Just type
57+
$ ulimit -s
58+
which returns the current value in KB (maybe 10240).
59+
And you can increase it by
60+
$ ulimit -s 20480
61+
or alternatively 'limit stacksize 20480' or more for TCSH.
62+
63+
5. Current environment at IoA Cambridge:
64+
CPU : Core i5-3570 (4 cores, 3.4 GHz)
65+
GPU : One GeForce GTX 660Ti (7 SMXs, 1344 CUDA cores)
66+
OS : CentOS 6.4 for x86_64
67+
Compiler : GCC 4.4.7 (default of CentOS)
68+
CUDA : CUDA 5.0 Production Release
69+
70+
Here is screen a shot of this configuration integrating
71+
64k stars from t=0 to t=2.
72+
73+
***********************
74+
Initializing NBODY6/GPU library
75+
#CPU 4, #GPU 1
76+
device: 0
77+
device 0: GeForce GTX 660 Ti
78+
***********************
79+
***********************
80+
Opened NBODY6/GPU library
81+
#CPU 4, #GPU 1
82+
device: 0
83+
0 65546
84+
nbmax = 65546
85+
***********************
86+
****************************
87+
Opening GPUIRR lib. AVX ver.
88+
nmax = 65546, lmax = 500
89+
****************************
90+
91+
***********************
92+
Closed NBODY6/GPU library
93+
time send : 0.875775 sec
94+
time grav : 20.004446 sec
95+
time reduce : 0.271385 sec
96+
time regtot : 21.151606 sec
97+
1164.470114 Gflops (gravity part only)
98+
***********************
99+
****************************
100+
Closing GPUIRR lib. CPU ver.
101+
time grav : 17.729413 sec
102+
103+
perf grav : 21.715221 Gflops
104+
perf grav : 163.998749 usec
105+
<#NB> : 76.406419
106+
****************************
107+
108+
109+
Keigo Nitadori
110+
April 2013

GPU2/README.2011

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
========================================
2+
GPU extension for NBODY6 ver. 2.1
3+
22 Aug. 2010, 20 Jan. 2011
4+
Keigo Nitadori ([email protected])
5+
and
6+
Sverre Aarseth ([email protected])
7+
========================================
8+
9+
1. About
10+
This is a new library package for enhancing the performance of NBODY6.
11+
The regular force calculation is accelerated by the use of GPU, whereas
12+
the irregular force calculation is accelerated by the SSE instruction set
13+
of x86 CPU and OpenMP for the efficient use of multiple cores.
14+
Some of the important subroutines of the original NBODY6 are modified
15+
and libraries written in C++ or CUDA are added.
16+
17+
2. System requirement
18+
This module is intended for use on a Linux x86_64 PC with CUDA enabled GPU.
19+
However, one can try to run it on Macintosh/Linux-32bit/Windows.
20+
21+
2-1. Hardware
22+
CPU : x86_64 with SSE3 support (from the 90nm generation for Intel/AMD).
23+
GPU : NVIDA GeForce/Tesla with CUDA support. GT200 or Fermi generation.
24+
2-2. Software
25+
OS : Any Linux for x86_64 where you can install CUDA environment.
26+
Compiler : GCC 4.1.2 or later with OpenMP enabled.
27+
C++ and FORTRAN support is required.
28+
CUDA : Toolkit and SDK 3.0 or later. SDK is only relevant for 'cutil.h'.
29+
30+
For reference, we list the environment used for development.
31+
CPU : Core i7 920
32+
GPU : Two GeForce GTX 470
33+
OS : CentOS 5.5 for x86_64
34+
Compiler : GCC 4.1.2 (with C++ and FORTRAN)
35+
CUDA : Toolkit and SDK 3.1
36+
37+
Note that one CPU socket (4 or 6 cores) for one (Fermi generation) GPU looks
38+
a well balanced combination. The regular force part is easily accelerated
39+
with GPU. However, the irregular force part is not easily accelerated on
40+
the GPU and is now performed on the host with the help of SSE and OpenMP.
41+
Some suggestions for hardware configuration:
42+
Entry choice ( $1,000): One Core i7 870 (Lynfield) + one GeForce GTX 460 (GF104)
43+
High-end choice($10,000): Two Xeon X5680 (Westmere-EP) + two Tesla C2050 (GF100)
44+
45+
3. Installation
46+
Extract the file 'gpu2.tar.gz' in the directory 'Nbody6'.
47+
After extracting, the directory structure would look as follows:
48+
49+
Nbody6 +
50+
|-Ncode : Original NBODY6
51+
|-GPU2 + : Makefile and FORTRAN files
52+
| |-lib : Regular-force library (in CUDA)
53+
| |-irrlib : Irregular-force library (in C++)
54+
| |-run
55+
|-Nchain
56+
|-...
57+
58+
Go to the directory 'GPU2',
59+
> cd GPU2
60+
and execute the shell script
61+
> ./install.sh
62+
to make symbolic links of 'params.h' and 'common6.h'.
63+
Then, just make it
64+
> make gpu
65+
to obtain the executable 'nbody6.gpu' in the directory 'run'.
66+
67+
4. GPU dependent parameters
68+
Some parameters need to be given at compile time to adjust the number of
69+
thread blocks to the physical number of processors of the GPU used.
70+
These values are defined in the file 'GPU2/lib/gpunb.reduce.cu', and some
71+
examples are:
72+
for GeForce GTX 460/470 or Tesla C2050,
73+
#define NJBLOCK 14
74+
#define NXREDUCE 16
75+
or,
76+
#define NJBLOCK 28
77+
#define NXREDUCE 32
78+
for GeForce GTX 280 or Tesla C1060,
79+
#define NJBLOCK 30
80+
#define NXREDUCE 32
81+
etc...
82+
One compiler option needs to be modified, depending on the GPU generation.
83+
Edit the file 'GPU/Makefie' on the line
84+
NVCC += -arch sm_20 -Xptxas -dlcm=cg
85+
For GTX 280 or C1060, it should be 'sm_13' and for GTX 460, 'sm_21'.
86+
87+
5. Environment variables
88+
By default, the library automatically finds and uses all the GPUs installed
89+
and all the CPU threads. In case you want to run multiple jobs on one PC
90+
with multiple GPUs, you need to specify the list of GPUs and the number
91+
of CPU threads. For example, if 2 GPUs and 8 host threads are available,
92+
two jobs can be run by:
93+
> GPU_LIST="0" OMP_NUM_THREADS=4 ../nbody6.gpu < in1 > out1 &
94+
> GPU_LIST="1" OMP_NUM_THREADS=4 ../nbody6.gpu < in2 > out2 &
95+
The default is equivalent to,
96+
> GPU_LIST="0 1" OMP_NUM_THREADS=8 ../nbody6.gpu < in > out &
97+
98+
6. Output
99+
Each regular or irregular library outputs some message to the screen (stderr)
100+
when it is opened and closed. At the close time, both of the libraries give
101+
some profiling information. An example of the output to the screen is as
102+
follows:
103+
104+
***********************
105+
Initializing NBODY6/GPU library
106+
#CPU 8, #GPU 2
107+
device: 0 1
108+
device 0: GeForce GTX 470
109+
device 1: GeForce GTX 470
110+
***********************
111+
***********************
112+
Opened NBODY6/GPU library
113+
#CPU 8, #GPU 2
114+
device: 0 1
115+
0 32768 65536
116+
nbmax = 65536
117+
***********************
118+
****************************
119+
Opening GPUIRR lib. SSE ver.
120+
nmax = 65546, lmax = 500
121+
****************************
122+
***********************
123+
Closed NBODY6/GPU library
124+
time send : 1.101684 sec
125+
time grav : 17.795590 sec
126+
time reduce : 0.387609 sec
127+
1315.947465 Gflops (gravity part only)
128+
***********************
129+
****************************
130+
Closing GPUIRR lib. CPU ver.
131+
time pred : 8.766780 sec
132+
time pact : 9.664287 sec
133+
time grav : 19.304246 sec
134+
time onep : 0.000000 sec
135+
136+
perf grav : 20.604658 Gflops
137+
perf pred : 8.123547 nsec
138+
perf pact : 125.747625 nsec
139+
****************************
140+
141+
7. Performance tuning using HUGEPAGE
142+
The default page size of x86 architecture is 4kB, which sometimes causes a
143+
performance loss in HPC (high performance computing) applications.
144+
Recent x86 CPUs and Linux kernel support larger page (huge page) whose size
145+
is 2MB.
146+
We have seen that the use of huge page improves the performance and
147+
describe the way to set up.
148+
149+
(i) Install 'libhugetlbfs' package
150+
For CentOS 5.5, just type (as superuser),
151+
> yum install libhugetlbfs*
152+
153+
(ii) Allocate huge pages
154+
For NBODY6, 512 pages = 1 GB would be sufficient allocation, if N < 100k.
155+
To allocate, type as superuser,
156+
> echo 512 > /proc/sys/vm/nr_hugepages
157+
It is recommended to write it in the boot sequence script (/etc/rc.local)
158+
for safe allocation.
159+
You can check the allocation status with the command
160+
> grep Huge /proc/meminfo
161+
to obtain,
162+
HugePages_Total: 512
163+
HugePages_Free: 512
164+
HugePages_Rsvd: 0
165+
Hugepagesize: 2048 kB
166+
167+
(iii) Mount the filesystem
168+
You need to mount hugetlbfs to some mount point (any mount point will do).
169+
Here, we just show an example line for 'fstab'.
170+
hugetlbfs /libhugetlbfs huge tlbfs mode=0777 0 0
171+
172+
(iv) Compiling NBODY6
173+
NBODY6 needs to be re-linked to put the common arrays on the huge-page.
174+
Comment out the following line from 'GPU2/Makefile_gpu' and link again.
175+
#LD_GPU += -B /usr/share/libhugetlbfs -Wl,--hugetlbfs-link=B
176+
177+
(v) Environment variable
178+
The following environment variable should be set to get the dynamically
179+
allocated memory from the huge pages:
180+
HUGETLB_MORECORE=yes
181+
182+
(vi) Run and watch
183+
While the program is running, you can watch the use of huge pages:
184+
> grep Huge /proc/meminfo
185+
186+
8. Conclusion
187+
Significant speed-up has been achieved from the previously released version.
188+
The bottleneck of the old version with an accelerated regular force
189+
calculation on the GPU was the remaining part performed on the host CPU.
190+
The iregular force, predictor and corrector part have been fine-tuned with
191+
SSE and OpenMP.
192+
Provisional timing tests have been carried out. Compared with the previous
193+
version (Aug. 2010), the speed-up is at least a factor of 2 for N=64k or N=100k.
194+
Based on the system presented in section 3 with 2 GPUs, typical wall-clock
195+
time from T=0 to T=2 (excluding the initialization) is 74 sec for N=64k and
196+
160 sec for N=100k. For predictions at regular force times we use the scalar
197+
version of cxvpred.cpp which is nearly twice as fast as the SSE version.

GPU2/adjust.f

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ SUBROUTINE ADJUST
66
*
77
INCLUDE 'common6.h'
88
COMMON/ECHAIN/ ECH
9+
* Jpetts - added access to common galaxy variables here.
10+
COMMON/GALAXY/ GMG,RG(3),VG(3),FG(3),FGD(3),TG,
11+
& OMEGA,DISK,A,B,V02,RL2,GMB,AR,GAM,ZDUM(7)
912
SAVE DTOFF
1013
DATA DTOFF /100.0D0/
1114
*
@@ -168,6 +171,13 @@ SUBROUTINE ADJUST
168171
*
169172
* Jpetts - recalculate dynamical friction variables
170173
CALL DYNFVARS
174+
175+
!TEMP - Print out DYNFCOEF in to some file
176+
177+
WRITE(101,43) TTOT, DYNFCOEF
178+
43 FORMAT (F10.5,F10.5)
179+
180+
!TEMP
171181
*
172182
* Scale average & maximum core density by the mean value.
173183
RHOD = 4.0*TWOPI*RHOD*RSCALE**3/(3.0*ZMASS)
@@ -283,6 +293,9 @@ SUBROUTINE ADJUST
283293
& ' TC =',0P,I5,' DELTA =',1P,E9.1,' E(3) =',0P,F10.6,
284294
& ' DETOT =',F10.6,' WTIME =',I4,2I3)
285295
CALL FLUSH(6)
296+
297+
298+
286299
*
287300
* Perform automatic error control (RETURN on restart with KZ(2) > 1).
288301
CALL CHECK(DE)
@@ -299,7 +312,8 @@ SUBROUTINE ADJUST
299312
END IF
300313
*
301314
* Check correction for c.m. displacements.
302-
IF (KZ(31).GT.0) THEN
315+
* Jpetts - Don't recentre when object unbound
316+
IF (KZ(31).GT.0 .AND. MASSCL .NE. 0) THEN
303317
CALL CMCORR
304318
END IF
305319
*

GPU2/debug.pdf

39.4 KB
Binary file not shown.

0 commit comments

Comments
 (0)