I work across model efficiency, training systems, and GPU kernels to build hardware-aware LLMs under real-world compute constraints.
- Hardware-aware LLM training (modeling, numerics, systems)
- GPU kernel optimization (CUDA, CuTe, Triton)
- Distributed training with Torchtitan, Megatron-LM, Transformer Engine, and FlashAttention
- 📌 Megatron-LM PR #3345: Improved the fused linear cross entropy path to address training-efficiency bottlenecks from large-logit materialization and memory traffic.


