CUDA Barnes-Hut Simulation

A high-performance, fully parallelized 3D N-body gravitational simulation implementing the Barnes-Hut algorithm entirely in CUDA with real-time Vulkan visualization. Benchmarked on Africa's fastest supercomputer, Toubkal.

Simulation Demo

Overview

This project implements the Barnes-Hut algorithm for N-body gravitational simulations in 3D with complete GPU parallelization. The Barnes-Hut algorithm reduces the computational complexity of N-body simulations from $O(N^2)$ to $O(N \log N)$ by approximating distant particle clusters as single masses using an octree spatial hierarchy.

The simulation runs entirely on the GPU through a series of highly optimized CUDA kernels, with real-time visualization rendered through Vulkan. An ImGui interface provides interactive control over simulation parameters, rendering settings, and displays real-time performance metrics.

Features

Fully Parallel GPU Implementation: All simulation steps execute on the GPU using custom CUDA kernels optimized for modern GPU architectures
Real-Time Vulkan Rendering: Seamless CUDA-Vulkan interoperability, keeping all simulation data on the GPU throughout execution
Interactive Controls: ImGui-based interface for adjusting simulation parameters, visualization settings, and monitoring performance metrics in real time
Optimized Octree Construction: Implementation based on Tero Karras's research on maximizing parallelism in the construction of trees, achieving significant performance improvements over the traditional dynamic parallelism approaches
CPU Baseline Implementation: All-pairs brute force CPU version for performance comparison and correctness validation
Benchmark Ready: Tested on consumer hardware (laptop RTX 3050 Ti) and HPC systems (NVIDIA A100 on Toubkal supercomputer)

Performance Benchmarks

Performance measurements demonstrate the scalability of the implementation across different GPU architectures. The benchmarks were conducted using NVIDIA NSight Compute for detailed kernel profiling.

Performance Comparison: RTX 3050 Ti vs NVIDIA A100

Benchmark conducted with Barnes-Hut theta parameter = 0.5

Particle Count	RTX 3050 Ti (Laptop)	NVIDIA A100 (Toubkal)	Speedup
100,000	4.3 ms	1.7 ms	x2.53
500,000	21.3 ms	4.4 ms	x4.84
1,000,000	35.4 ms	8.5 ms	x4.16
5,000,000	182.0 ms	38.3 ms	x4.75
10,000,000	370.1 ms	68.8 ms	x5.37
50,000,000	(Insufficient Memory)	380.5 ms	____

Theta Parameter Impact on Performance

Benchmark conducted with 1,000,000 particles on both architectures

Theta (θ)	RTX 3050 Ti	NVIDIA A100	Description
0.25	183.1 ms	29.6 ms	Higher accuracy, slower performance
0.50	35.4 ms	8.5 ms	Balanced accuracy and performance
0.75	20.1 ms	5.1 ms	Lower accuracy, faster performance
1.00	11.8 ms	3.2 ms	Aggressive approximations

Note

Lower theta values result in more accurate simulations but require more computation. Theta = 0.5 is typically used as a good balance between accuracy and performance.

Simulation Pipeline

Each simulation frame executes a sequence of steps, with each step implemented through one or more optimized CUDA kernels:

Bounding Box Computation: Parallel reduction across all particles to compute the axis-aligned bounding box that defines the octree's root node spatial extent.
Morton Code Generation: Convert each particle's 3D position into a 30-bit Morton code (Z-order curve) to create a 1D ordering that preserves spatial locality.
Particle Sorting: Sort particles by Morton code using parallel radix sort, grouping spatially adjacent particles together in memory for efficient octree construction.
Morton Code Fusion: Identify and merge particles with duplicate Morton codes to prevent degenerate cases.
Octree Construction: Build the hierarchical octree structure using Tero Karras's fully parallel algorithm, enabling construction without sequential dependencies.
Force Computation and Integration: Traverse the octree for each particle to compute gravitational forces using the Barnes-Hut approximation criterion, then integrate to update velocities and positions. This step dominates execution time despite aggressive optimizations (adapted from this paper) due to the irregular, data-dependent nature of tree traversal.

Build Instructions

Note

This project currently supports Linux only.

Prerequisites

The following dependencies are required to build the project:

CUDA Toolkit
Vulkan SDK
GLFW3
CMake 3.20 or higher
C++17 compatible compiler

Build Steps

Clone the repository:

git clone https://github.com/MouadHaikal/CUDA-Barnes-Hut
cd CUDA-Barnes-Hut

A build script is available for convenience:

./build.sh

Run the GPU simulation:

./bin/gpu/NBodyGPU

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
assets/fonts		assets/fonts
screenshots		screenshots
shaders		shaders
src		src
vendor		vendor
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Barnes-Hut Simulation

Table of Contents

Overview

Features

Performance Benchmarks

Performance Comparison: RTX 3050 Ti vs NVIDIA A100

Theta Parameter Impact on Performance

Simulation Pipeline

Build Instructions

Prerequisites

Build Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA Barnes-Hut Simulation

Table of Contents

Overview

Features

Performance Benchmarks

Performance Comparison: RTX 3050 Ti vs NVIDIA A100

Theta Parameter Impact on Performance

Simulation Pipeline

Build Instructions

Prerequisites

Build Steps

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages