Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layernorm performance optimization and partition size pybind #3

Closed
wants to merge 8 commits into from

Conversation

mawong-amd
Copy link

@mawong-amd mawong-amd commented Mar 4, 2024

This PR does two things:

  1. Provide performance optimizations of the fused_add_rms_norm kernels (used in some layernorms, e.g. input and post_attention per Llama decode layer). The performance benefits largely originate from two optimizations: use of shared memory to cache intermediate computation results (as opposed to the existing use of global memory), and use of packed operations for FP16 inputs. Unrolling was attempted but this was not found to affect performance. Another optimization attempted was the specialization of blockReduceSum/warpReduceSum to AMD wavesizes of 64 (as opposed to the existing CUDA-compatible warp size of 32), which should in theory reduce the number of shuffles by (1024/32 * 5 + 5) / (1024/64 * 6 + 4) = 1.65 times, but this was also not found to measurably affect performance.

Typical performance improvements are on the order of 10% based on average runtime of the kernels as tested on Llama2-7B and Llama2-70B models on MI300X. Performance can probably be improved further, but we are running into diminishing returns as the total runtime of layernorm is not large. One interesting observation is that the post-attention layernorm on LL2-70B consistently takes 4-5 us longer than the input layernorm to complete (16 us vs 20 us), while this behavior is not observed for LL2-7B. It is unclear why this is the case, could be cache-related.

  1. Add platform-specific paged attention v2 partition sizes and expose these to Python.

@mawong-amd mawong-amd requested a review from dllehr-amd March 4, 2024 19:03
@mawong-amd mawong-amd self-assigned this Mar 4, 2024
@dllehr-amd dllehr-amd requested a review from sanyalington March 4, 2024 20:28
@mawong-amd mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 73a9c85 to 85277d9 Compare March 4, 2024 21:04
@mawong-amd mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 85277d9 to 1396c2c Compare March 4, 2024 21:07
@mawong-amd mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 8c6423d to dcfc084 Compare March 5, 2024 21:13
@mawong-amd mawong-amd marked this pull request as draft March 8, 2024 19:26
@mawong-amd
Copy link
Author

Pending work on the prefill side of things

AdrianAbeyta pushed a commit that referenced this pull request Mar 8, 2024
Rename remaining fp8_e5m2 to general fp8
@mawong-amd mawong-amd closed this Mar 27, 2024
@mawong-amd mawong-amd deleted the vllm_upstream_mattwong_expmtal branch March 27, 2024 19:50
gshtras pushed a commit that referenced this pull request Sep 27, 2024
Use 12660 and 12969 as base builds for v3 and v4 Dockerfiles
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant