[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm #13231

gshtras · 2025-02-13T16:48:19Z

Performance improvement for ROCm working around the hardware limitation.

In GEMM, you can have significant Tagram channel hotspot problems on MI300 if the stride of a matrix is a multiple of 512 bytes in GEMM. This is especially true for TN transpose cases, which might increase the latency of VMEM instructions, resulting in a significant drop in performance. If it's possible (or makes sense), stride padding can be used to avoid any stride multiple of 512 bytes (for example, for TN F16 GEMM, lda = M + 128 when M%256==0) from the application when allocating memory for the matrices.

One requirement for this is for w8a8_block_fp8_matmul to support the non-contiguous weights, which it seems to already do, so the leftover assertion is obsolete.
While maintaining the same correctness, this shows the following latency improvement on ROCm:
amd/Llama-3.1-8B-Instruct-FP8-KV bs=64 in=512 out=512 tp=1:
5.95s -> 5.7s (4%)
amd/Llama-3.1-70B-Instruct-FP8-KV bs=64 4in=512 out=512 tp=1:
25.6s -> 24.3s (5%)
deepseek-ai/DeepSeek-R1 bs=64 in=256 out=256 tp=8:
26.1s -> 24.9 (5%)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

… strides Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

github-actions · 2025-02-13T16:48:33Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

NickLucche · 2025-02-18T14:31:10Z

vllm/model_executor/layers/quantization/fp8.py

+                and (weight.stride(-2) * weight.element_size()) % 512 == 0):
+            num_pad = 256 // weight.element_size()
+            weight = F.pad(weight, (0, num_pad), "constant", 0)[..., :-num_pad]
+            torch.cuda.empty_cache()


is empty_cache really necessary here?

Without it there is a possibility of having double the memory allocated, depending on the allocator behavior

vllm/model_executor/layers/quantization/fp8.py

NickLucche · 2025-02-18T14:42:25Z

Thanks for contributing! 🙏🏻
I only had a few comments to add while actual review from code owners is pending.

vllm/model_executor/layers/quantization/fp8.py

Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

robertgshaw2-redhat · 2025-02-21T15:46:21Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

@@ -477,7 +477,7 @@ def w8a8_block_fp8_matmul(
    assert triton.cdiv(A.shape[-1], block_k) == As.shape[-1]
    M = A.numel() // A.shape[-1]

-    assert B.ndim == 2 and B.is_contiguous() and Bs.ndim == 2
+    assert B.ndim == 2 and Bs.ndim == 2


Are we sure this is okay?

The kernel works just fine with a padded non-contiguous tensor. And in any scenario other than with padding it should be contiguous already, so no existing workflow is supposed to break.

One other option is just to call weight.contiguous() after we pad it in process_weights_after_loading?

This would remove the padding, reverting the F.pad action

sorry, that was a dumb comment by me

@gshtras I agree contiguous here was overly strict. But should we still check that the stride is 1 for the last dimension? B.stride(-1) == 1?

robertgshaw2-redhat · 2025-02-21T15:48:02Z

Nice work!

gshtras added 2 commits February 12, 2025 23:47

Apply FP8 weights padding to 256 bytes on ROCm

bbab81f

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Removing the contiguous requirement, as the kernel supports arbitrary…

2205c07

… strides Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners February 13, 2025 16:48

gshtras changed the title ~~[ROCm] Apply FP8 weights padding to 256 bytes on ROCm~~ [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm Feb 13, 2025

hongxiayang added the rocm Related to AMD ROCm label Feb 13, 2025

NickLucche reviewed Feb 18, 2025

View reviewed changes

mgoin reviewed Feb 18, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Outdated Show resolved Hide resolved

Change the order of the checks

f3da192

Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras force-pushed the fp8_padding_upstream branch from 6106325 to f3da192 Compare February 18, 2025 17:35

robertgshaw2-redhat reviewed Feb 21, 2025

View reviewed changes

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 21, 2025

robertgshaw2-redhat enabled auto-merge (squash) February 21, 2025 15:52

robertgshaw2-redhat disabled auto-merge February 21, 2025 15:53

robertgshaw2-redhat enabled auto-merge (squash) February 21, 2025 16:42

robertgshaw2-redhat approved these changes Feb 21, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into fp8_padding_upstream

d3bb507

simon-mo merged commit c904fdd into vllm-project:main Feb 22, 2025
42 of 46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm #13231

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm #13231

gshtras commented Feb 13, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 13, 2025

NickLucche Feb 18, 2025

gshtras Feb 18, 2025

NickLucche commented Feb 18, 2025

robertgshaw2-redhat Feb 21, 2025

gshtras Feb 21, 2025

robertgshaw2-redhat Feb 21, 2025

robertgshaw2-redhat Feb 21, 2025

gshtras Feb 21, 2025

robertgshaw2-redhat Feb 21, 2025

ProExpertProg Feb 25, 2025

robertgshaw2-redhat commented Feb 21, 2025

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm #13231

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm #13231

Conversation

gshtras commented Feb 13, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NickLucche commented Feb 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented Feb 21, 2025

gshtras commented Feb 13, 2025 •

edited by github-actions bot

Loading