Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

iree-org / iree-turbine Public

Notifications You must be signed in to change notification settings
Fork 32
Star 62

Code
Issues 40
Pull requests 30
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Wave February 2025 Release #362

Open

8 of 58 tasks

harsh-nod opened this issue Jan 6, 2025 · 0 comments

Open

8 of 58 tasks

Wave February 2025 Release #362

harsh-nod opened this issue Jan 6, 2025 · 0 comments

Comments

Copy link

Contributor

harsh-nod commented Jan 6, 2025 •

edited

Loading

Milestones

Integrate decode attention kernel into sglang
Performance on flash decoding
Performance on vanilla attention
Functionality & Performance on backward attention

FlashDecoding

Flash Decoding Kernel with paged attention
Flash Decoding Kernel without paged attention with 32x32x8 (tile size of 64 for K2 dimension) (Harsh)
Flash Decoding Kernel without paged attention with Dynamic Dims (Harsh)
Performance optimizations
Representing the dynamic dimension in mapping as an iterator (Ivan)
Sample kernel showing how paged memory works (PyTorch kernel where we have a fake page table and fake queries that we can use for functionality testing) [how its done purely in PyTorch] (Harsh)
Add support for buffer loads (masked loads/stores, gather/scatter) (Ivan) (check if needed? based on BLOCK_DPE)
tkw conditional support
tkw set symbol and apply expr operators
Reordering loads/stores during promotion
SGLANG Integration (new Wave backend)
Make paged attention kernels have dynamic symbols to avoid recompiling
Prefill attention kernel

EvoFormer

Larger global loads
Vectorized reads
FAv3 - FP8
FAv3 - Scheduling
FAv3 - Wave specialization (set_prio)
Scalar support
Transpose using Shuffles
Performance Nightly-ci (different machine, what is being tested), add to iree-kernel-benchmark?
Backward Flash Attention

Multi-Buffering

Generate hipblaslt kernels that are faster than wave for the shapes of interest to get a reference
GEMM kernel working with double buffering
Language Integration
Fully functional multi-buffering approach for GEMMs
Performance evaluation

Benchmarking

Benchmarking using Github Actions
Updgrading to latest IREE version
Failures on iree-kernel-benchmark

Tech Debt

Rewrite expansion
switch ci to use venv
IGEMM compilation failures

IGEMM

Shared memory data shuffle
Scalarizing the gather
Move gather from global to shared
Reduce the number of shared memory barriers
Improve test case coverage (different dtypes, mfma intrinsics, shapes, etc.)
Enable scheduling
BF16 MFMA Intrinsics
Fix failures on main
Perf-ci

De-Prioritized

Packed Shuffles
Linear offset has to be added (linear offset = 1.0 / max representable number in fp format)
Extend Attention (split-k vs warp reduction)
Prefill Attention
Update Paper
Debugger support (add breakpoints and inspect stack on GPU)
Profiling support
Ensure that mappings modify the index sequence
GEMM Non-temporal loads
GEMM + SiLU fusion kernel
MoE Kernel
Parallel compile and then run

Week 1

Week 2

Week 3

Finish conditional and integrate into paged decode attention kernel and check that it works for small sequence lengths
Land final changes for PDA
Performance benchmarking on medium and large sizes
Identify areas of optimization
Enable buffer loads

Goal: Performance improvements

Week 4

Goal: Replacement for existing kernel in sglang

The text was updated successfully, but these errors were encountered:

All reactions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

No branches or pull requests

1 participant

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.