Peel away conditional check from TileAndFuse padding kernel inner K loop #19276

jerryyin · 2024-11-22T21:54:10Z

When padding is enabled in #18858. We should peel away conditional check which turns out to be necessary only in boundary conditions.

%cst = arith.constant 0.000000e+00 : f32
scf.for %arg11 = %c0 to %c64 step %c64 ...%^%
...
%41 = scf.if %39 -> (tensor<1x1xf32>) {
  %generated = tensor.generate  {                                                                                                                                                                                                                                                                                                                                                 ^bb0(%arg13: index, %arg14: index):
    tensor.yield %cst : f32
  } : tensor<1x1xf32>
  scf.yield %generated : tensor<1x1xf32>
} else {
  %extracted_slice_9 = tensor.extract_slice %extracted_slice[%30, %35] [%32, %37] [1, 1] : tensor<?x577xf32> to tensor<?x?xf32>
  %padded = tensor.pad %extracted_slice_9 low[0, 0] high[%34, %40] {
  ^bb0(%arg13: index, %arg14: index):                                                                                                                                                                                                                                                                                                                                               tensor.yield %cst : f32
  } : tensor<?x?xf32> to tensor<?x?xf32>
  %cast = tensor.cast %padded : tensor<?x?xf32> to tensor<1x1xf32>
  scf.yield %cast : tensor<1x1xf32>
}
%extracted_slice_8 = tensor.extract_slice %arg12[%29#0, %29#1] [1, 1] [1, 1] : tensor<16x4xf32> to tensor<1x1xf32>
%42 = linalg.copy {lowering_config = #iree_gpu.derived_thread_config} ins(%41 : tensor<1x1xf32>) outs(%extracted_slice_8 : tensor<1x1xf32>) -> tensor<1x1xf32>
%inserted_slice = tensor.insert_slice %42 into %arg12[%29#0, %29#1] [1, 1] [1, 1] : tensor<1x1xf32> into tensor<16x4xf32>
...
scf.yield %inserted_slice ...
}

In the above example, scf.if condition will only be satisfied when out of bound, yielding %cst of zero in such conditions.

Instead, we can potentially populate full static inner k loop with condition only in below situations:

For gemms that needs padding in the K dimension, the if/else condition check should be completely peeled out of the loop. There will be one additional deterministic extract_slice after K loop finish to perform another MFMA with padded zeros
For gemms that needs padding in the M or N dimension. In this case, majority of workgroups doesn't need the condition check except the tiles that happen to be in the boundary. We should think twice how to populate the kernel code to avoid the conditional check.

jerryyin self-assigned this Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peel away conditional check from TileAndFuse padding kernel inner K loop #19276

Peel away conditional check from TileAndFuse padding kernel inner K loop #19276

jerryyin commented Nov 22, 2024 •

edited

Loading

Peel away conditional check from TileAndFuse padding kernel inner K loop #19276

Peel away conditional check from TileAndFuse padding kernel inner K loop #19276

Comments

jerryyin commented Nov 22, 2024 • edited Loading

jerryyin commented Nov 22, 2024 •

edited

Loading