This module focuses on GPU memory hierarchy mastery and performance optimization: shared memory tiling, memory coalescing, texture/read-only memory usage, unified memory, and bandwidth optimization.
After completing this module, you will be able to:
- Organize GPU threads in multidimensional grids
- Map threads efficiently to data structures
- Implement image processing algorithms on GPU
- Create optimized matrix multiplication kernels
- Handle boundary conditions in multidimensional algorithms
- content.md - Complete module content
- examples/ - Practical code examples
- NVIDIA GPU with CUDA support OR AMD GPU with ROCm support
- CUDA Toolkit 13.0+ or ROCm 7.0+ (Docker images provide CUDA 13.0.1 and ROCm 7.0)
- C/C++ compiler (GCC, Clang, or MSVC)
Recommended: use our Docker dev environment
./docker/scripts/run.sh --auto
cd modules/module2/examples
make # auto-detects your GPU and builds accordingly
# Run a few examples (binaries in build/)
./build/01_shared_memory_transpose_cuda # or _hip on AMD
./build/02_memory_coalescing_cuda # or _hip on AMD
./build/04_unified_memory_cuda- 2D and 3D thread block configurations
- Grid size calculations for arbitrary data sizes
- Thread-to-data mapping strategies
- Coalesced vs strided access
- Structure of Arrays vs Array of Structures
- Read-only/texture cache benefits
- Tiled transpose with bank-conflict avoidance
- Block-level cooperation and synchronization
- Padding strategies to avoid bank conflicts
- Unified memory prefetch and advice
- Measuring and optimizing memory bandwidth
- Analyzing profiler metrics for memory performance
Duration: 6-8 hours
Difficulty: Beginner-Intermediate
Prerequisites: Module 1 completion