You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Flash Decoding Kernel without paged attention with 32x32x8 (tile size of 64 for K2 dimension) (Harsh)
Flash Decoding Kernel without paged attention with Dynamic Dims (Harsh)
Performance optimizations
Representing the dynamic dimension in mapping as an iterator (Ivan)
Sample kernel showing how paged memory works (PyTorch kernel where we have a fake page table and fake queries that we can use for functionality testing) [how its done purely in PyTorch] (Harsh)
Add support for buffer loads (masked loads/stores, gather/scatter) (Ivan) (check if needed? based on BLOCK_DPE)
tkw conditional support
tkw set symbol and apply expr operators
Reordering loads/stores during promotion
SGLANG Integration (new Wave backend)
Make paged attention kernels have dynamic symbols to avoid recompiling
Prefill attention kernel
EvoFormer
Larger global loads
Vectorized reads
FAv3 - FP8
FAv3 - Scheduling
FAv3 - Wave specialization (set_prio)
Scalar support
Transpose using Shuffles
Performance Nightly-ci (different machine, what is being tested), add to iree-kernel-benchmark?
Backward Flash Attention
Multi-Buffering
Generate hipblaslt kernels that are faster than wave for the shapes of interest to get a reference
GEMM kernel working with double buffering
Language Integration
Fully functional multi-buffering approach for GEMMs
Performance evaluation
Benchmarking
Benchmarking using Github Actions
Updgrading to latest IREE version
Failures on iree-kernel-benchmark
Tech Debt
Rewrite expansion
switch ci to use venv
IGEMM compilation failures
IGEMM
Shared memory data shuffle
Scalarizing the gather
Move gather from global to shared
Reduce the number of shared memory barriers
Improve test case coverage (different dtypes, mfma intrinsics, shapes, etc.)
Enable scheduling
BF16 MFMA Intrinsics
Fix failures on main
Perf-ci
De-Prioritized
Packed Shuffles
Linear offset has to be added (linear offset = 1.0 / max representable number in fp format)
Extend Attention (split-k vs warp reduction)
Prefill Attention
Update Paper
Debugger support (add breakpoints and inspect stack on GPU)
Profiling support
Ensure that mappings modify the index sequence
GEMM Non-temporal loads
GEMM + SiLU fusion kernel
MoE Kernel
Parallel compile and then run
Week 1
Week 2
Week 3
Finish conditional and integrate into paged decode attention kernel and check that it works for small sequence lengths
Land final changes for PDA
Performance benchmarking on medium and large sizes
Identify areas of optimization
Enable buffer loads
Goal: Performance improvements
Week 4
Goal: Replacement for existing kernel in sglang
The text was updated successfully, but these errors were encountered:
Milestones
FlashDecoding
EvoFormer
Multi-Buffering
Benchmarking
Tech Debt
IGEMM
De-Prioritized
Week 1
Week 2
Week 3
Goal: Performance improvements
Week 4
Goal: Replacement for existing kernel in sglang
The text was updated successfully, but these errors were encountered: