A high-performance, fully parallelized 3D N-body gravitational simulation implementing the Barnes-Hut algorithm entirely in CUDA with real-time Vulkan visualization. Benchmarked on Africa's fastest supercomputer, Toubkal.
This project implements the Barnes-Hut algorithm for N-body gravitational simulations in 3D with complete GPU parallelization. The Barnes-Hut algorithm reduces the computational complexity of N-body simulations from
The simulation runs entirely on the GPU through a series of highly optimized CUDA kernels, with real-time visualization rendered through Vulkan. An ImGui interface provides interactive control over simulation parameters, rendering settings, and displays real-time performance metrics.
- Fully Parallel GPU Implementation: All simulation steps execute on the GPU using custom CUDA kernels optimized for modern GPU architectures
- Real-Time Vulkan Rendering: Seamless CUDA-Vulkan interoperability, keeping all simulation data on the GPU throughout execution
- Interactive Controls: ImGui-based interface for adjusting simulation parameters, visualization settings, and monitoring performance metrics in real time
- Optimized Octree Construction: Implementation based on Tero Karras's research on maximizing parallelism in the construction of trees, achieving significant performance improvements over the traditional dynamic parallelism approaches
- CPU Baseline Implementation: All-pairs brute force CPU version for performance comparison and correctness validation
- Benchmark Ready: Tested on consumer hardware (laptop RTX 3050 Ti) and HPC systems (NVIDIA A100 on Toubkal supercomputer)
Performance measurements demonstrate the scalability of the implementation across different GPU architectures. The benchmarks were conducted using NVIDIA NSight Compute for detailed kernel profiling.
Benchmark conducted with Barnes-Hut theta parameter = 0.5
| Particle Count | RTX 3050 Ti (Laptop) | NVIDIA A100 (Toubkal) | Speedup |
|---|---|---|---|
| 100,000 | 4.3 ms | 1.7 ms | x2.53 |
| 500,000 | 21.3 ms | 4.4 ms | x4.84 |
| 1,000,000 | 35.4 ms | 8.5 ms | x4.16 |
| 5,000,000 | 182.0 ms | 38.3 ms | x4.75 |
| 10,000,000 | 370.1 ms | 68.8 ms | x5.37 |
| 50,000,000 | (Insufficient Memory) | 380.5 ms | ____ |
Benchmark conducted with 1,000,000 particles on both architectures
| Theta (θ) | RTX 3050 Ti | NVIDIA A100 | Description |
|---|---|---|---|
| 0.25 | 183.1 ms | 29.6 ms | Higher accuracy, slower performance |
| 0.50 | 35.4 ms | 8.5 ms | Balanced accuracy and performance |
| 0.75 | 20.1 ms | 5.1 ms | Lower accuracy, faster performance |
| 1.00 | 11.8 ms | 3.2 ms | Aggressive approximations |
Note
Lower theta values result in more accurate simulations but require more computation. Theta = 0.5 is typically used as a good balance between accuracy and performance.
Each simulation frame executes a sequence of steps, with each step implemented through one or more optimized CUDA kernels:
-
Bounding Box Computation: Parallel reduction across all particles to compute the axis-aligned bounding box that defines the octree's root node spatial extent.
-
Morton Code Generation: Convert each particle's 3D position into a 30-bit Morton code (Z-order curve) to create a 1D ordering that preserves spatial locality.
-
Particle Sorting: Sort particles by Morton code using parallel radix sort, grouping spatially adjacent particles together in memory for efficient octree construction.
-
Morton Code Fusion: Identify and merge particles with duplicate Morton codes to prevent degenerate cases.
-
Octree Construction: Build the hierarchical octree structure using Tero Karras's fully parallel algorithm, enabling construction without sequential dependencies.
-
Force Computation and Integration: Traverse the octree for each particle to compute gravitational forces using the Barnes-Hut approximation criterion, then integrate to update velocities and positions. This step dominates execution time despite aggressive optimizations (adapted from this paper) due to the irregular, data-dependent nature of tree traversal.
Note
This project currently supports Linux only.
The following dependencies are required to build the project:
- CUDA Toolkit
- Vulkan SDK
- GLFW3
- CMake 3.20 or higher
- C++17 compatible compiler
Clone the repository:
git clone https://github.com/MouadHaikal/CUDA-Barnes-Hut
cd CUDA-Barnes-HutA build script is available for convenience:
./build.shRun the GPU simulation:
./bin/gpu/NBodyGPU
