[TKW] Scalarized local reduction for faster Max #335

raikonenfnu · 2024-12-16T11:28:12Z

Max reduction can make use of v_max3_f32 which is much faster than regular v_max_f32. However to enable this we'd need to scalarize the local reduction more S.T LLVM compiler can do this optimization better. (drop latency from
4.5 ms to ~4.3 ms on dispatch146(B0: 2, B1: 20, (M, K2): 1024: K1: 64)

Max reduction can make use of v_max3_f32 which is much faster than regular v_max_f32. However to enable this we'd need to scalarize the local reduction more S.T LLVM compiler can do this optimization better. Signed-off-by: Stanley Winata <stanley.winata@amd.com>

harsh-nod

lgtm! thanks!

raikonenfnu requested a review from harsh-nod December 16, 2024 11:28

harsh-nod approved these changes Dec 16, 2024

View reviewed changes

raikonenfnu merged commit 142c8a5 into iree-org:main Dec 16, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TKW] Scalarized local reduction for faster Max #335

[TKW] Scalarized local reduction for faster Max #335

raikonenfnu commented Dec 16, 2024

harsh-nod left a comment

[TKW] Scalarized local reduction for faster Max #335

[TKW] Scalarized local reduction for faster Max #335

Conversation

raikonenfnu commented Dec 16, 2024

harsh-nod left a comment

Choose a reason for hiding this comment