Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'func.func' op uses 873872 bytes of shared memory; exceeded the limit of 65536 bytes using LLVMGPUSIMT #18905

Open
nirvedhmeshram opened this issue Oct 25, 2024 · 7 comments
Assignees
Labels
codegen Shared code generation infrastructure and dialects

Comments

@nirvedhmeshram
Copy link
Contributor

nirvedhmeshram commented Oct 25, 2024

For this matmul like + elementwise IR, we go down the LLVMGPUSIMT pipeline, see dump here . Today TileandFuse Vectorize can handle this case correctly but ideally we want this to be handled by TileandFuse Matmul pipeline.

Compile command for SIMT

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 input_ir.mlir -o output.vmfb --mlir-print-ir-before-all --mlir-disable-threading &> simt.mlir`

Compile command for TileandFuse Vectorize

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 input_ir.mlir -o output.vmfb --mlir-print-ir-before-all --mlir-disable-threading --iree-codegen-llvmgpu-test-tile-and-fuse-vectorize &> tileandfuse.mlir
@nirvedhmeshram
Copy link
Contributor Author

nirvedhmeshram commented Nov 4, 2024

I am realizing that the matmul-like op shared in the gist in the issue is quite egde-casey so we might want to look at the whole model and think why we have this matmul-like op and should we have done something differently in pre-processing to not reach this op. Here is the front-end program causing this shape

module {
  func.func @main_graph(%arg1: !torch.vtensor<[1,64,112,112],f32> , %arg2: !torch.vtensor<[1,128,56,56],f32>, %arg3: !torch.vtensor<[64,64,3,3],f32>, %arg4: !torch.vtensor<[64],f32> , %arg5: !torch.vtensor<[128,64,1,1],f32>, %arg6: !torch.vtensor<[128],f32>) -> !torch.vtensor<[1,128,56,56],f32>   attributes {torch.onnx_meta.ir_version = 8 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.opset_versions = {ai.onnx.contrib = 1 : si64, ai.onnx.ml = 4 : si64, ai.onnx.preview.training = 1 : si64, ai.onnx.training = 1 : si64, com.microsoft = 1 : si64, com.microsoft.experimental = 1 : si64, com.microsoft.nchwc = 1 : si64, org.pytorch.aten = 1 : si64}, torch.onnx_meta.producer_name = "vai_q_onnx", torch.onnx_meta.producer_version = "1.17.0+43059a7"} {
    %1 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<0> : tensor<si8>} : () -> !torch.vtensor<[],si8> 
    %2 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<1.562500e-02> : tensor<f32>} : () -> !torch.vtensor<[],f32> 
    %3 = torch.operator "onnx.QuantizeLinear"(%arg2, %2, %1) : (!torch.vtensor<[1,128,56,56],f32>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,128,56,56],si8> 
    %4 = torch.operator "onnx.Conv"(%arg1, %arg3, %arg4) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [2 : si64, 2 : si64]} : (!torch.vtensor<[1,64,112,112],f32>, !torch.vtensor<[64,64,3,3],f32>, !torch.vtensor<[64],f32>) -> !torch.vtensor<[1,64,56,56],f32> 
    %5 = torch.operator "onnx.DequantizeLinear"(%3, %2, %1) : (!torch.vtensor<[1,128,56,56],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,128,56,56],f32> 
    %6 = torch.operator "onnx.Relu"(%4) : (!torch.vtensor<[1,64,56,56],f32>) -> !torch.vtensor<[1,64,56,56],f32> 
    %7 = torch.operator "onnx.Conv"(%6, %arg5, %arg6) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [1 : si64, 1 : si64], torch.onnx.pads = [0 : si64, 0 : si64, 0 : si64, 0 : si64], torch.onnx.strides = [1 : si64, 1 : si64]} : (!torch.vtensor<[1,64,56,56],f32>, !torch.vtensor<[128,64,1,1],f32>, !torch.vtensor<[128],f32>) -> !torch.vtensor<[1,128,56,56],f32> 
    %8 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<0> : tensor<si8>} : () -> !torch.vtensor<[],si8> 
    %9 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<1.562500e-02> : tensor<f32>} : () -> !torch.vtensor<[],f32> 
    %10 = torch.operator "onnx.QuantizeLinear"(%7, %9, %8) : (!torch.vtensor<[1,128,56,56],f32>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,128,56,56],si8> 
    %11 = torch.operator "onnx.DequantizeLinear"(%10, %9, %8) : (!torch.vtensor<[1,128,56,56],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,128,56,56],f32> 
    %12 = torch.operator "onnx.Add"(%11, %5) : (!torch.vtensor<[1,128,56,56],f32>, !torch.vtensor<[1,128,56,56],f32>) -> !torch.vtensor<[1,128,56,56],f32> 
    %13 = torch.operator "onnx.Relu"(%12) : (!torch.vtensor<[1,128,56,56],f32>) -> !torch.vtensor<[1,128,56,56],f32> 
    return %13 : !torch.vtensor<[1,128,56,56],f32>
  }
}

cc @MaheshRavishankar @IanWood1

@IanWood1
Copy link
Contributor

IanWood1 commented Nov 4, 2024

The matmul is coming from %7 = torch.operator "onnx.Conv" which gets converted to (compile-to=input):

%13 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%12, %4 : tensor<1x64x56x56xf32>, tensor<128x64x1x1xf32>) outs(%broadcasted_3 : tensor<1x128x56x56xf32>) -> tensor<1x128x56x56xf32>

And then generalized to the linalg.generic matmul-like op (from #18736). @nirvedhmeshram I'm not sure how this could be better represented or what the issue is, however.

@nirvedhmeshram
Copy link
Contributor Author

The matmul is coming from %7 = torch.operator "onnx.Conv" which gets converted to (compile-to=input):

%13 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%12, %4 : tensor<1x64x56x56xf32>, tensor<128x64x1x1xf32>) outs(%broadcasted_3 : tensor<1x128x56x56xf32>) -> tensor<1x128x56x56xf32>
And then generalized to the linalg.generic matmul-like op (from #18736). @nirvedhmeshram I'm not sure how this could be better represented or what the issue is, however.

Thanks for taking a look, I dont think there is any reason it cant be supported but maybe just needs new configuration logic and see if anything breaks, so we can take a look at adding that.

@jerryyin
Copy link
Member

jerryyin commented Nov 5, 2024

So far my progress is:

  1. Understood about the matmul like ir. It is a combination of:
    • matmul like op that except that gemmB argument have one additional trailing dimension. If I concat [d1, d2] together, then this is exactly a matmul
    • The trailing elementwise ops consists of two independent routes of quantize-dequantize compute, with results aggregated together
  2. Can confirm I can reproduce the context of this ticket:
    • The base case SIMT will hit into a failure at GPUCheckResourceUsagePass - It is quite understandable why this way lowering would exceed the limit of LDS size: The IR is attempting to load a fair amount of tensors naively from global to LDS.
    • --iree-codegen-llvmgpu-test-tile-and-fuse-matmul would yield into exactly similar failure as baseline SIMT
    • --iree-codegen-llvmgpu-test-tile-and-fuse-vectorize would successfully pass resource check and compile
  3. Per @nirvedhmeshram's pointer, starting to look at why tileAndFuse matmul didn't succeed at here. The failure is around failing to set root config if I force it through
  4. Per discussion above, it seems reasonable why we yield a matmul like op as we do from the particular convolution
    • The convolution is: filter: 128x64x1x1; input: 1x64x56x56; output: 1x128x56x56
    • The matmul like op is: gemmA: 128x64; gemmB: 64x56x56; gemmC: 128x56x56
    • The matmul like code snippet seems to be 1:1 mapping with the onnx code provided by @nirvedhmeshram except the first conv + relu. I believe the high level pipeline must have hoisted the below out of the function as the 3x3 conv + relu might be a common pattern around the high level graph execution.
    %4 = torch.operator "onnx.Conv"(%arg1, %arg3, %arg4) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [2 : si64, 2 : si64]} : (!torch.vtensor<[1,64,112,112],f32>, !torch.vtensor<[64,64,3,3],f32>, !torch.vtensor<[64],f32>) -> !torch.vtensor<[1,64,56,56],f32> 
    %6 = torch.operator "onnx.Relu"(%4) : (!torch.vtensor<[1,64,56,56],f32>) -> !torch.vtensor<[1,64,56,56],f32> 

Next steps:

  • Compare with tileAndFuse matmul and tileAndFuse vectorze and see in particular how the two options impacted the LLVMGPUSelectLoweringStrategyPass, and understand why one pass but the other fail
  • Remove trailing ops from the matmul like IR and see whether tileAndFuse matmul is able to handle it

@IanWood1
Copy link
Contributor

IanWood1 commented Nov 5, 2024

matmul like op that except that gemmB argument have one additional trailing dimension. If I concat [d1, d2] together, then this is exactly a matmul

Yeah, this is something that should be handled by compiler/src/iree/compiler/DispatchCreation/CollapseDimensions.cpp but its currently blocked by some codegen issues (should be resolved shortly (#18822):

// TODO(#17948) GPU codegen fails when we collapse the dimensions of softmax.
if (llvm::any_of(genericOp.getDpsInputOperands(),
[&](OpOperand *operand) -> bool {
auto genericOperand =
operand->get().getDefiningOp<linalg::GenericOp>();
if (!genericOperand)
return false;
if (genericOperand.getNumReductionLoops() == 0)
return false;
return genericOp.getMatchingIndexingMap(operand)
.isProjectedPermutation();
})) {
return false;
}

I quickly commented this line out and was able to get it to compile successfully

@jerryyin
Copy link
Member

jerryyin commented Nov 5, 2024

@IanWood1 Thanks for putting extra effort in getting it to compile. It is not immediately clear to me which line you've commented out is able to make it work. But I assume you've made it such that 56x56 will be collapsed to a single dimension. With that single dimension tileAndFuse Matmul would be able to work correctly (versus, @nirvedhmeshram pointed out that he thinks 56x56 not divisible by 64 could be the problem, and kindly enough provided me the tracking ticket to make it work, here). Assuming I understand the background all clearly, what this means is that once #18822 get merged, the symptom of this ticket will go away as you've attempted. That's good news and re-assures me that this ticket will be gone even if I don't do anything about it :-p

I chatted with @nirvedhmeshram a moment ago, I think a few things still don't fully explain itself and worth following through on:

  • Between tileAndFuse and tileAndFuseMatmul, the IR before SelectLoweringStrategy are exactly the same. I've attempted to pull out that module and have iree-opt to just run through the standalone pass, and it is able to finish without any problem.
    • This is counter intuitive, needs to get to the bottom of why the standalone pass pulled out on its own is able to compile
  • Per @nirvedhmeshram's suggestion, I should test 64x64 manually and see if full compilation can finish
  • Finish my next steps plan to get to confirm that the reason existing compilation failed is indeed what we assume it is

@jerryyin
Copy link
Member

jerryyin commented Nov 6, 2024

Summary and conclusions:

  1. Why tileAndFuse matmul was failing?
    It failed because it fails in deducing a valid MMA schedule (setMatmulLoweringConfig()). In particular, The gemm dimensions are [m: 56, n: 128, k: 64], and it is not divisible by 16x16x4 mfma instruction, according to below therefore giving up.

    if (mustBeAligned && (problem.mSizes.back() % intrinsic.mSizes[0] != 0 ||
    problem.nSizes.back() % intrinsic.nSizes[0] != 0 ||
    problem.kSizes.back() % intrinsic.kSizes[0] != 0)) {

  2. Why tileAndFuse vectorize can pass?
    It is going down a different routine (setTileAndFuseLoweringConfig()). For the config size of [128, 56, 56], the picked workgroup tile size is [4, 8, 8]; per thread tile size is [1, 1, 4]. (side topic:) Considering that we got 64 threads per wave, this configuration is using 1 wave per workgroup. From the performance point of view, the occupancy of the generated MFA kernel seems to be quite poor.

  3. Why would I be able to pass tileAndFuse matmul by pulling out IR before SelectLoweringStrategy pass?
    Because I forget to use --iree-codegen-llvmgpu-test-tile-and-fuse-matmul argument when invoking iree-opt, therefore falling back to setVectorDistributionConfig(). Although the part I don't understand is about how to make sense about the lowering_config<tile_sizes = [[1, 64, 128, 64]]> for the dimension size of [128, 56, 56, 64]. I'd like to understand more about how conventional SIMT pass allocate tile.

  4. What's the minimal change for the matmullike.mlir to make it pass?
    Minimal change is to make the fastest changing dimension of gemmA tensor shape to be 64 instead of 56 to make it divisible by 16. The new shape should be: [k: 64, batch: 56, m: 64]. The updated IR to make it pass is here. I can confirm f32.16x16x4f32 mfma instructions generated in llvm dialect. However, if I make the non-fastest changing dimension to be 64, looks like it still fail to compile. I think this is a bug that should be addressed. With an arbitrary dimension divisible by 16 should all be permitted in existing tileAndFuse matmul pipeline.

For next steps this ticket is on hold, resolution of either below will make the problem in this ticket go away:

  • The workaround pointed by @IanWood1, last two dimensions collapsed together making it divisible by 16.
  • Padding support: desirably we should pick a single leftover dimension that requires least amount of paddings to make it divisible by 16.

@jerryyin jerryyin added the codegen Shared code generation infrastructure and dialects label Nov 6, 2024
@zjgarvey zjgarvey added this to the Scalability on AMD GPU milestone Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen Shared code generation infrastructure and dialects
Projects
Status: In Progress
Development

No branches or pull requests

4 participants