This module covers advanced GPU programming concepts focusing on fundamental parallel algorithms and optimization patterns essential for high-performance computing.
- Master fundamental parallel algorithm patterns (reduction, scan, sorting)
- Understand algorithm complexity trade-offs (work vs depth complexity)
- Implement efficient stencil and convolution computations
- Apply modern CUDA cooperative groups programming model
- Design scalable algorithms for large datasets
- Analyze and optimize parallel algorithm performance
- Completion of Module 1 (GPU Programming Foundations)
- Completion of Module 2 (Advanced Memory Management)
- Understanding of basic parallel computing concepts
- Familiarity with algorithm complexity analysis
- content.md - Comprehensive guide covering all advanced algorithm patterns
- Naive parallel reduction implementation
- Optimized reduction with shared memory
- Modern warp-level reductions using cooperative groups
- Multi-pass reduction for large datasets
- Performance comparison and analysis
Key Concepts:
- Thread divergence optimization
- Bank conflict avoidance
- Warp-level primitives
- Work complexity vs depth complexity
- Naive inclusive/exclusive scan
- Hillis-Steele scan (O(n log n) work, O(log n) depth)
- Blelloch scan (O(n) work, O(log n) depth)
- Segmented scan for multiple sequences
- Large array scanning strategies
- AMD wavefront-optimized implementations
Key Concepts:
- Work-efficient vs step-efficient algorithms
- Up-sweep and down-sweep phases
- Multi-level scanning approaches
- Wavefront vs warp optimizations
- Bitonic sorting network
- Parallel comparison-based sorting
- Radix sort implementation
- Odd-even sort for small arrays
- Performance comparison with CPU algorithms
Key Concepts:
- Sorting networks
- Data-parallel sorting strategies
- Algorithm scalability analysis
- Cross-platform optimizations
- 1D stencil computations with shared memory
- 2D convolution with halo region handling
- 3D stencil operations
- Separable convolution optimization
- Boundary condition management
Key Concepts:
- Halo region loading
- Memory coalescing in stencil patterns
- Shared memory tiling strategies
- Separable filters for performance
- Tiled matrix multiplication with shared memory
- Matrix transpose with bank conflict avoidance
- Matrix-vector multiplication using cooperative groups
- Performance analysis and GFLOPS calculation
Key Concepts:
- Shared memory tiling patterns
- Memory bandwidth optimization
- Cooperative groups programming
- Performance metrics calculation
- Breadth-First Search (BFS) with frontier queues
- Single-Source Shortest Path (SSSP)
- Connected Components labeling
- Triangle counting
- PageRank power iteration
Key Concepts:
- CSR graph representation
- Frontier-based graph traversal
- Atomic operations for graph updates
- Load balancing in irregular computations
- Modern CUDA cooperative groups API
- Warp-level primitives and shuffle operations
- Block-level and grid-level coordination
- Advanced synchronization patterns
- Multi-GPU cooperative kernels
Key Concepts:
- Cooperative groups hierarchy
- Warp-level programming
- Modern CUDA synchronization
- Multi-device coordination
- HIP Examples: Cross-platform implementations supporting both AMD and NVIDIA GPUs
- Platform-specific optimizations for wavefront (64 threads) vs warp (32 threads) architectures
- Complete HIP implementations for reduction, scan, sorting, and convolution algorithms
# Build all CUDA examples
make cuda
# Build all HIP examples (requires HIP compiler)
make hip
# Build specific example
make 01_reduction_algorithms_cuda# Run all tests
make test
# Run performance benchmarks
make test_performance
# Test compilation only
make test_compile# Reduction algorithms
make reduction
./01_reduction_algorithms_cuda
# Scan algorithms
make scan
./02_scan_prefix_sum_cuda
# Sorting algorithms
make sorting
./03_sorting_algorithms_cuda
# Convolution/Stencil
make convolution
./04_convolution_stencil_cuda
# Matrix operations
make matrix
./05_matrix_operations_cuda
# Graph algorithms
make graph
./06_graph_algorithms_cuda
# Cooperative groups
make cooperative
./07_cooperative_groups_cuda
# HIP examples (cross-platform)
make scan_hip
./02_scan_prefix_sum_hip| Algorithm | Naive | Optimized | Work Complexity | Depth Complexity |
|---|---|---|---|---|
| Reduction | O(n²) | O(n) | O(n) | O(log n) |
| Scan (Hillis-Steele) | O(n²) | O(n log n) | O(n log n) | O(log n) |
| Scan (Blelloch) | - | O(n) | O(n) | O(log n) |
| Bitonic Sort | - | O(n log² n) | O(n log² n) | O(log² n) |
| Matrix Multiply | O(n³) | O(n³) | O(n³) | O(log n) |
| BFS | O(V + E) | O(V + E) | O(V + E) | O(D) |
| Triangle Count | O(V³) | O(E^1.5) | O(E^1.5) | O(log V) |
Use the provided benchmarking targets:
# Run performance analysis
make test_performance
# Show profiling commands
make profile_algorithms
# Algorithm complexity analysis
make analyze_complexity#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void modernReduction(float *data) {
auto block = cg::this_thread_block();
auto warp = cg::tiled_partition<32>(block);
// Warp-level reduction using shuffle
for (int offset = 16; offset > 0; offset >>= 1) {
value += warp.shfl_down(value, offset);
}
}__global__ void optimizedStencil(float *input, float *output) {
__shared__ float shared[BLOCK_SIZE + 2*RADIUS];
// Load data + halo regions
// Apply computation using shared memory
// Minimize global memory accesses
}// Handle large datasets with multiple kernel launches
void largeArrayReduction(float *data, int n) {
while (n > 1) {
int blocks = (n + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
reductionKernel<<<blocks, THREADS_PER_BLOCK>>>(data, n);
n = blocks;
}
}# Throughput analysis
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./01_reduction_algorithms_cuda
# Memory analysis
ncu --metrics dram__throughput.avg.pct_of_peak_sustained_elapsed ./04_convolution_stencil_cuda
# Shared memory bank conflicts
ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum ./02_scan_prefix_sum_cuda- Work Complexity: Total operations performed
- Depth Complexity: Critical path length
- Memory Complexity: Space requirements
- Communication Complexity: Data movement costs
- Memory Access Patterns: Coalescing and caching
- Thread Divergence: Minimizing warp divergence
- Resource Utilization: Occupancy optimization
- Algorithm Selection: Choosing appropriate algorithms for problem size
- Large Dataset Handling: Multi-pass algorithms
- Multi-GPU Scaling: Algorithm decomposition
- Memory Hierarchy: Utilizing all levels of GPU memory
-
Compilation Errors
- Ensure CUDA/HIP toolkit is properly installed
- Check compute capability requirements
- Verify C++11 support
-
Runtime Errors
- Check array bounds and memory allocation
- Verify kernel launch parameters
- Use CUDA_CHECK macro for error handling
-
Performance Issues
- Profile memory access patterns
- Check for thread divergence
- Analyze occupancy and resource usage
- CUDA Examples: NVIDIA GPU with compute capability 5.0+
- HIP Examples: AMD GPU (GCN architecture) or NVIDIA GPU
- Memory: Examples scale with available GPU memory
- CUDA Programming Guide (Algorithm Patterns section)
- "Programming Massively Parallel Processors" by Kirk & Hwu
- NVIDIA Performance Tuning Guide
- AMD ROCm Documentation for HIP programming
- Implement Custom Algorithms: Create your own reduction or scan variants
- Performance Optimization: Optimize existing examples for your specific hardware
- Algorithm Comparison: Benchmark different algorithmic approaches
- Multi-GPU Scaling: Extend examples to use multiple GPUs
- Real-World Applications: Apply patterns to specific problem domains
Note: This module provides both educational implementations (showing algorithm progression) and optimized versions. Focus on understanding the concepts before optimizing for specific use cases.