My annotated papers, slides, and meeting recordings for the EleutherAI ML Scalability & Performance research paper reading group.
Sessions:
- Intro to GPU architecture, CUDA, NCCL, and common ML performance bottlenecks
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
- Ring Attention with Blockwise Transformers for Near-Infinite Context Length
- Efficient Memory Management for Large Language Model Serving with PagedAttention