Layernorm performance optimization and partition size pybind #3

mawong-amd · 2024-03-04T19:03:47Z

This PR does two things:

Provide performance optimizations of the fused_add_rms_norm kernels (used in some layernorms, e.g. input and post_attention per Llama decode layer). The performance benefits largely originate from two optimizations: use of shared memory to cache intermediate computation results (as opposed to the existing use of global memory), and use of packed operations for FP16 inputs. Unrolling was attempted but this was not found to affect performance. Another optimization attempted was the specialization of blockReduceSum/warpReduceSum to AMD wavesizes of 64 (as opposed to the existing CUDA-compatible warp size of 32), which should in theory reduce the number of shuffles by (1024/32 * 5 + 5) / (1024/64 * 6 + 4) = 1.65 times, but this was also not found to measurably affect performance.

Typical performance improvements are on the order of 10% based on average runtime of the kernels as tested on Llama2-7B and Llama2-70B models on MI300X. Performance can probably be improved further, but we are running into diminishing returns as the total runtime of layernorm is not large. One interesting observation is that the post-attention layernorm on LL2-70B consistently takes 4-5 us longer than the input layernorm to complete (16 us vs 20 us), while this behavior is not observed for LL2-7B. It is unclear why this is the case, could be cache-related.

Add platform-specific paged attention v2 partition sizes and expose these to Python.

…ling; bulk conversions (packed halfs into half2); block and warp reduce with AMD wavesize 64 (vs 32)

…t completely as not rate-determining)

…ert warp size changes

mawong-amd · 2024-03-08T19:26:52Z

Pending work on the prefill side of things

Rename remaining fp8_e5m2 to general fp8

Use 12660 and 12969 as base builds for v3 and v4 Dockerfiles

mawong-amd added 6 commits February 28, 2024 17:48

Experiments on layernorm optimization: compile-time hidden size unrol…

8bc96c6

…ling; bulk conversions (packed halfs into half2); block and warp reduce with AMD wavesize 64 (vs 32)

Add shared memory optimization, warp size back to 32 (likely to rever…

a375153

…t completely as not rate-determining)

Final optimization choices: shared memory and packed conversions. Rev…

551cbd2

…ert warp size changes

Add comments, handle odd hidden sizes

64dbd2f

Minor style refactoring

8eb7d98

Add PyBind for PagedAttn v2 partition size

631a50f

mawong-amd requested a review from dllehr-amd March 4, 2024 19:03

mawong-amd self-assigned this Mar 4, 2024

dllehr-amd requested a review from sanyalington March 4, 2024 20:28

mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 73a9c85 to 85277d9 Compare March 4, 2024 21:04

Fix a bug in the odd hidden size case

1396c2c

mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 85277d9 to 1396c2c Compare March 4, 2024 21:07

Remove bf16 include <= not using bf16 packed operations

dcfc084

mawong-amd force-pushed the vllm_upstream_mattwong_expmtal branch from 8c6423d to dcfc084 Compare March 5, 2024 21:13

mawong-amd marked this pull request as draft March 8, 2024 19:26

AdrianAbeyta pushed a commit that referenced this pull request Mar 8, 2024

Merge pull request #3 from ROCm/greg/tweaks

7f5623d

Rename remaining fp8_e5m2 to general fp8

mawong-amd closed this Mar 27, 2024

mawong-amd deleted the vllm_upstream_mattwong_expmtal branch March 27, 2024 19:50

gshtras pushed a commit that referenced this pull request Sep 27, 2024

Merge pull request #3 from ROCmSoftwarePlatform/vllm_dockerfile

cb67a34

Use 12660 and 12969 as base builds for v3 and v4 Dockerfiles

japarada mentioned this pull request Feb 3, 2025

[Bug]: Running Llama-2-70b inference on MI300x getting OOM #397

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layernorm performance optimization and partition size pybind #3

Layernorm performance optimization and partition size pybind #3

mawong-amd commented Mar 4, 2024 •

edited

Loading

mawong-amd commented Mar 8, 2024

Layernorm performance optimization and partition size pybind #3

Layernorm performance optimization and partition size pybind #3

Conversation

mawong-amd commented Mar 4, 2024 • edited Loading

mawong-amd commented Mar 8, 2024

mawong-amd commented Mar 4, 2024 •

edited

Loading