CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
-
Updated
May 11, 2026 - Cuda
CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
A GPU performance prediction toolkit for CUDA programs
A Python script for plotting roofline analyses. Intel Advisor style.
Code Comprehension Assistance for Evidence-Based performance Tuning
High-performance Sobel edge detection using CUDA with CPU vs GPU benchmarking, roofline analysis, and Nsight profiling.
Fork of the CS Roofline Toolkit from Berkeley Lab
CLI tool for estimating compute, memory bandwidth, and operational intensity of transformer models from Hugging Face configuration files. Ideal for performance and hardware deployment analysis.
Repository for the research paper, "District-scale life cycle costing of climate-neutral retrofits, based on automatic envelope detection workflows and LOD3 3D city model".
High-performance CUDA matrix multiplication kernels - shared memory tiling, register blocking, Roofline Model analysis. Benchmarked against cuBLAS.
A practical model (with math + Python) to tell if you’re compute-, memory-, or network-bound—and what to buy next
Roofline analysis of BitNet b1.58 2B4T inference across H100, MI300X, Groq LPU, Cerebras WSE-3, and two hypothetical ternary chips. Five phases covering gate counts, precision sweeps, memory bandwidth, hybrid activation Pareto, and prefill regime, plus empirical weight-distribution validation against the published microsoft/bitnet weights.
CPU microbenchmark suite for x86-64 AVX2/FMA and AArch64 NEON. Measures latency, bandwidth, peak FLOPS, builds roofline models, and predicts GEMM tile sizes.
Python lab for exploring memory bandwidth, cache effects, and locality in accelerator workloads
Profiling, Benchmarking and Analysis of Numerical Kernels derived from common Supervised Machine Learning Algorithms on consumer grade computing hardware.
Add a description, image, and links to the roofline-model topic page so that developers can more easily learn about it.
To associate your repository with the roofline-model topic, visit your repo's landing page and select "manage topics."