Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization #869

happierpig · 2025-02-18T00:36:58Z

Summary

This PR introduces naive FP8 tensor core computation following FA3's implementations. The main code modifications are located in include/flashinfer/attention/hopper/quantization, with test cases in src/fp8-dev. The primary changes include:

In-Kernel V Transpose
Since wgmma.fp8 requires K-major for both operands, and Q/K/V are all head_dim-major, V transpose is required before feeding into tensor core. Therefore we provide an in-kernel transpose in shared memory using ldmatrix/stmatrix (kernel_traits.cuh#L54)
P Requantization
After Q * K multiplication and before P * V, P is requantized per tensor using an oracle scale: p_scale = std::numeric_limits::max();. This is based on the observation that the maximum value of P is 1 in online softmax. This strategy follows the approach in SageAttention, which increases the utilization of 8-bit width compared to direct cast.
Fused Dequantization
To reduce CUDA core computation overhead, both QK and PV dequantization steps are fused into existing online softmax operations: QK dequantization is fused into sm_scale (code reference). PV dequantization is fused into the finalize step, where the denominator is applied to the output.

Remaining Work

Upstream modifications to separate head_dim_qk and head_dim_v
Optimize performance to close the gap between FA3 and FlashInfer FP8, possibly tuning kernel launch parameters
Add sparse and quantized support in sparse_mainloop.cuh

Perf Benchmarks on H100

FlashInfer-FP8 on average provides 20-30% throughput boost compared to FP16. However, there exists a performance gap between FA3-FP8, calling for further optimizations. Ref to scripts.

Correctness (MSE)

To validate accuracy, we compute MSE between different FP8 implementations, and output from FP16 FlashInfer. Ref to scripts. Our impl is slightly better.

yzh119

I'm good with the PR in general and let's merge it first and then iterate.

yzh119 · 2025-02-26T16:55:18Z

include/flashinfer/attention/hopper/variants.cuh


 }  // namespace flashinfer

-#endif  // FLASHINFER_ATTENTION_HOPPER_VARIANTS_CUH_
+#endif  // FLASHINFER_ATTENTION_HOPPER_VARIANTS_CUH_


Can you use pre-commit to format code?

Sounds good. Done.

happierpig added 9 commits February 17, 2025 22:52

[init] cp corresponding files

50b2409

[backup] before adding v smem in-kernel transpose

3352b00

[minor] add init duplicate mainloop_load_fp8

713951f

[minor] init fp8 epilogue.

e8cbbd6

[feat] draft fp8 transpose & implementation w/o correctness.

a25ae42

[refactor] fuse pv dequantize & automate p_scale

1cc390c

[minor] tune parameters

cc601f2

[minor] add perf & correctness tests.

01f1b0a

upstream

4fad7ef

happierpig requested a review from yzh119 February 18, 2025 00:36

yzh119 mentioned this pull request Feb 19, 2025

[Roadmap] FlashInfer v0.2 to v0.3 #675

Open

15 tasks

yzh119 approved these changes Feb 26, 2025

View reviewed changes

[Minor] format code w/ pre-commit

70bad23

yzh119 approved these changes Feb 27, 2025

View reviewed changes

yzh119 merged commit f5dec3d into flashinfer-ai:main Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization #869

Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization #869

happierpig commented Feb 18, 2025

yzh119 left a comment •

edited

Loading

yzh119 Feb 26, 2025

happierpig Feb 27, 2025

Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization #869

Naive Support for Hopper FP8 Prefill Kernel with Per-Head Quantization #869

Conversation

happierpig commented Feb 18, 2025

Summary

Remaining Work

Perf Benchmarks on H100

Correctness (MSE)

yzh119 left a comment • edited Loading

Choose a reason for hiding this comment

yzh119 Feb 26, 2025

Choose a reason for hiding this comment

happierpig Feb 27, 2025

Choose a reason for hiding this comment

yzh119 left a comment •

edited

Loading