forked from flashinfer-ai/flashinfer
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
perf: accelerate gqa performance (flashinfer-ai#356)
Changes: 1. Prefetch page indices (we have already done such optimization on decode kernels, but not on append/prefill kernels which was used in GQA). 2. Unlock 1x4 warp layout in flashinfer-ai#322, we didn't enable this because the binary size is too large, we should further reduce some unnecessary template arguments. 3. Optimize `threadblock_sync_mdo_states` for efficient merging attention states of multiple warps in a threadblock. Our previous implementation assumes small shared memory size and interleaves shared memory reads/writes with computations, which is not as efficient as a bulk shared memory access. After this PR, the GQA kernel execution time (on H100) for setting `batch_size=128, seq_len=1024, num_qo_heads=32, num_kv_heads=4, head_dim=128` was improved from 133us to 103us.
- Loading branch information
Showing
5 changed files
with
84 additions
and
52 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters