Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wave February 2025 Release #362

Open
8 of 58 tasks
harsh-nod opened this issue Jan 6, 2025 · 0 comments
Open
8 of 58 tasks

Wave February 2025 Release #362

harsh-nod opened this issue Jan 6, 2025 · 0 comments

Comments

@harsh-nod
Copy link
Contributor

harsh-nod commented Jan 6, 2025

Milestones

  • Integrate decode attention kernel into sglang
  • Performance on flash decoding
  • Performance on vanilla attention
  • Functionality & Performance on backward attention

FlashDecoding

  • Flash Decoding Kernel with paged attention
  • Flash Decoding Kernel without paged attention with 32x32x8 (tile size of 64 for K2 dimension) (Harsh)
  • Flash Decoding Kernel without paged attention with Dynamic Dims (Harsh)
  • Performance optimizations
  • Representing the dynamic dimension in mapping as an iterator (Ivan)
  • Sample kernel showing how paged memory works (PyTorch kernel where we have a fake page table and fake queries that we can use for functionality testing) [how its done purely in PyTorch] (Harsh)
  • Add support for buffer loads (masked loads/stores, gather/scatter) (Ivan) (check if needed? based on BLOCK_DPE)
  • tkw conditional support
  • tkw set symbol and apply expr operators
  • Reordering loads/stores during promotion
  • SGLANG Integration (new Wave backend)
  • Make paged attention kernels have dynamic symbols to avoid recompiling
  • Prefill attention kernel

EvoFormer

  • Larger global loads
  • Vectorized reads
  • FAv3 - FP8
  • FAv3 - Scheduling
  • FAv3 - Wave specialization (set_prio)
  • Scalar support
  • Transpose using Shuffles
  • Performance Nightly-ci (different machine, what is being tested), add to iree-kernel-benchmark?
  • Backward Flash Attention

Multi-Buffering

  • Generate hipblaslt kernels that are faster than wave for the shapes of interest to get a reference
  • GEMM kernel working with double buffering
  • Language Integration
  • Fully functional multi-buffering approach for GEMMs
  • Performance evaluation

Benchmarking

  • Benchmarking using Github Actions
  • Updgrading to latest IREE version
  • Failures on iree-kernel-benchmark

Tech Debt

  • Rewrite expansion
  • switch ci to use venv
  • IGEMM compilation failures

IGEMM

  • Shared memory data shuffle
  • Scalarizing the gather
  • Move gather from global to shared
  • Reduce the number of shared memory barriers
  • Improve test case coverage (different dtypes, mfma intrinsics, shapes, etc.)
  • Enable scheduling
  • BF16 MFMA Intrinsics
  • Fix failures on main
  • Perf-ci

De-Prioritized

  • Packed Shuffles
  • Linear offset has to be added (linear offset = 1.0 / max representable number in fp format)
  • Extend Attention (split-k vs warp reduction)
  • Prefill Attention
  • Update Paper
  • Debugger support (add breakpoints and inspect stack on GPU)
  • Profiling support
  • Ensure that mappings modify the index sequence
  • GEMM Non-temporal loads
  • GEMM + SiLU fusion kernel
  • MoE Kernel
  • Parallel compile and then run

Week 1

Week 2

Week 3

  • Finish conditional and integrate into paged decode attention kernel and check that it works for small sequence lengths
  • Land final changes for PDA
  • Performance benchmarking on medium and large sizes
  • Identify areas of optimization
  • Enable buffer loads

Goal: Performance improvements

Week 4

Goal: Replacement for existing kernel in sglang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant