Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peel away conditional check from TileAndFuse padding kernel inner K loop #19276

Open
jerryyin opened this issue Nov 22, 2024 · 0 comments
Open
Assignees

Comments

@jerryyin
Copy link
Member

jerryyin commented Nov 22, 2024

When padding is enabled in #18858. We should peel away conditional check which turns out to be necessary only in boundary conditions.

%cst = arith.constant 0.000000e+00 : f32
scf.for %arg11 = %c0 to %c64 step %c64 ...%^%
...
%41 = scf.if %39 -> (tensor<1x1xf32>) {
  %generated = tensor.generate  {                                                                                                                                                                                                                                                                                                                                                 ^bb0(%arg13: index, %arg14: index):
    tensor.yield %cst : f32
  } : tensor<1x1xf32>
  scf.yield %generated : tensor<1x1xf32>
} else {
  %extracted_slice_9 = tensor.extract_slice %extracted_slice[%30, %35] [%32, %37] [1, 1] : tensor<?x577xf32> to tensor<?x?xf32>
  %padded = tensor.pad %extracted_slice_9 low[0, 0] high[%34, %40] {
  ^bb0(%arg13: index, %arg14: index):                                                                                                                                                                                                                                                                                                                                               tensor.yield %cst : f32
  } : tensor<?x?xf32> to tensor<?x?xf32>
  %cast = tensor.cast %padded : tensor<?x?xf32> to tensor<1x1xf32>
  scf.yield %cast : tensor<1x1xf32>
}
%extracted_slice_8 = tensor.extract_slice %arg12[%29#0, %29#1] [1, 1] [1, 1] : tensor<16x4xf32> to tensor<1x1xf32>
%42 = linalg.copy {lowering_config = #iree_gpu.derived_thread_config} ins(%41 : tensor<1x1xf32>) outs(%extracted_slice_8 : tensor<1x1xf32>) -> tensor<1x1xf32>
%inserted_slice = tensor.insert_slice %42 into %arg12[%29#0, %29#1] [1, 1] [1, 1] : tensor<1x1xf32> into tensor<16x4xf32>
...
scf.yield %inserted_slice ...
}

In the above example, scf.if condition will only be satisfied when out of bound, yielding %cst of zero in such conditions.

Instead, we can potentially populate full static inner k loop with condition only in below situations:

  1. For gemms that needs padding in the K dimension, the if/else condition check should be completely peeled out of the loop. There will be one additional deterministic extract_slice after K loop finish to perform another MFMA with padded zeros
  2. For gemms that needs padding in the M or N dimension. In this case, majority of workgroups doesn't need the condition check except the tiles that happen to be in the boundary. We should think twice how to populate the kernel code to avoid the conditional check.
@jerryyin jerryyin self-assigned this Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant