Skip to content

MouadHaikal/CUDA-Barnes-Hut

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUDA Barnes-Hut Simulation

A high-performance, fully parallelized 3D N-body gravitational simulation implementing the Barnes-Hut algorithm entirely in CUDA with real-time Vulkan visualization. Benchmarked on Africa's fastest supercomputer, Toubkal.

Simulation Demo

Table of Contents

Overview

This project implements the Barnes-Hut algorithm for N-body gravitational simulations in 3D with complete GPU parallelization. The Barnes-Hut algorithm reduces the computational complexity of N-body simulations from $O(N^2)$ to $O(N \log N)$ by approximating distant particle clusters as single masses using an octree spatial hierarchy.

The simulation runs entirely on the GPU through a series of highly optimized CUDA kernels, with real-time visualization rendered through Vulkan. An ImGui interface provides interactive control over simulation parameters, rendering settings, and displays real-time performance metrics.

Features

  • Fully Parallel GPU Implementation: All simulation steps execute on the GPU using custom CUDA kernels optimized for modern GPU architectures
  • Real-Time Vulkan Rendering: Seamless CUDA-Vulkan interoperability, keeping all simulation data on the GPU throughout execution
  • Interactive Controls: ImGui-based interface for adjusting simulation parameters, visualization settings, and monitoring performance metrics in real time
  • Optimized Octree Construction: Implementation based on Tero Karras's research on maximizing parallelism in the construction of trees, achieving significant performance improvements over the traditional dynamic parallelism approaches
  • CPU Baseline Implementation: All-pairs brute force CPU version for performance comparison and correctness validation
  • Benchmark Ready: Tested on consumer hardware (laptop RTX 3050 Ti) and HPC systems (NVIDIA A100 on Toubkal supercomputer)

Performance Benchmarks

Performance measurements demonstrate the scalability of the implementation across different GPU architectures. The benchmarks were conducted using NVIDIA NSight Compute for detailed kernel profiling.

Performance Comparison: RTX 3050 Ti vs NVIDIA A100

Benchmark conducted with Barnes-Hut theta parameter = 0.5

Particle Count RTX 3050 Ti (Laptop) NVIDIA A100 (Toubkal) Speedup
100,000 4.3 ms 1.7 ms x2.53
500,000 21.3 ms 4.4 ms x4.84
1,000,000 35.4 ms 8.5 ms x4.16
5,000,000 182.0 ms 38.3 ms x4.75
10,000,000 370.1 ms 68.8 ms x5.37
50,000,000 (Insufficient Memory) 380.5 ms ____

Theta Parameter Impact on Performance

Benchmark conducted with 1,000,000 particles on both architectures

Theta (θ) RTX 3050 Ti NVIDIA A100 Description
0.25 183.1 ms 29.6 ms Higher accuracy, slower performance
0.50 35.4 ms 8.5 ms Balanced accuracy and performance
0.75 20.1 ms 5.1 ms Lower accuracy, faster performance
1.00 11.8 ms 3.2 ms Aggressive approximations

Note

Lower theta values result in more accurate simulations but require more computation. Theta = 0.5 is typically used as a good balance between accuracy and performance.

Simulation Pipeline

Each simulation frame executes a sequence of steps, with each step implemented through one or more optimized CUDA kernels:

  1. Bounding Box Computation: Parallel reduction across all particles to compute the axis-aligned bounding box that defines the octree's root node spatial extent.

  2. Morton Code Generation: Convert each particle's 3D position into a 30-bit Morton code (Z-order curve) to create a 1D ordering that preserves spatial locality.

  3. Particle Sorting: Sort particles by Morton code using parallel radix sort, grouping spatially adjacent particles together in memory for efficient octree construction.

  4. Morton Code Fusion: Identify and merge particles with duplicate Morton codes to prevent degenerate cases.

  5. Octree Construction: Build the hierarchical octree structure using Tero Karras's fully parallel algorithm, enabling construction without sequential dependencies.

  6. Force Computation and Integration: Traverse the octree for each particle to compute gravitational forces using the Barnes-Hut approximation criterion, then integrate to update velocities and positions. This step dominates execution time despite aggressive optimizations (adapted from this paper) due to the irregular, data-dependent nature of tree traversal.

Build Instructions

Note

This project currently supports Linux only.

Prerequisites

The following dependencies are required to build the project:

  • CUDA Toolkit
  • Vulkan SDK
  • GLFW3
  • CMake 3.20 or higher
  • C++17 compatible compiler

Build Steps

Clone the repository:

git clone https://github.com/MouadHaikal/CUDA-Barnes-Hut
cd CUDA-Barnes-Hut

A build script is available for convenience:

./build.sh

Run the GPU simulation:

./bin/gpu/NBodyGPU

Releases

No releases published

Packages

 
 
 

Contributors