'func.func' op uses 873872 bytes of shared memory; exceeded the limit of 65536 bytes using LLVMGPUSIMT #18905

nirvedhmeshram · 2024-10-25T16:38:39Z

For this matmul like + elementwise IR, we go down the LLVMGPUSIMT pipeline, see dump here . Today TileandFuse Vectorize can handle this case correctly but ideally we want this to be handled by TileandFuse Matmul pipeline.

Compile command for SIMT

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 input_ir.mlir -o output.vmfb --mlir-print-ir-before-all --mlir-disable-threading &> simt.mlir`

Compile command for TileandFuse Vectorize

iree-compile --iree-hal-target-backends=rocm --iree-hip-target=gfx942 input_ir.mlir -o output.vmfb --mlir-print-ir-before-all --mlir-disable-threading --iree-codegen-llvmgpu-test-tile-and-fuse-vectorize &> tileandfuse.mlir

The text was updated successfully, but these errors were encountered:

nirvedhmeshram · 2024-11-04T20:45:27Z

I am realizing that the matmul-like op shared in the gist in the issue is quite egde-casey so we might want to look at the whole model and think why we have this matmul-like op and should we have done something differently in pre-processing to not reach this op. Here is the front-end program causing this shape

module {
  func.func @main_graph(%arg1: !torch.vtensor<[1,64,112,112],f32> , %arg2: !torch.vtensor<[1,128,56,56],f32>, %arg3: !torch.vtensor<[64,64,3,3],f32>, %arg4: !torch.vtensor<[64],f32> , %arg5: !torch.vtensor<[128,64,1,1],f32>, %arg6: !torch.vtensor<[128],f32>) -> !torch.vtensor<[1,128,56,56],f32>   attributes {torch.onnx_meta.ir_version = 8 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.opset_versions = {ai.onnx.contrib = 1 : si64, ai.onnx.ml = 4 : si64, ai.onnx.preview.training = 1 : si64, ai.onnx.training = 1 : si64, com.microsoft = 1 : si64, com.microsoft.experimental = 1 : si64, com.microsoft.nchwc = 1 : si64, org.pytorch.aten = 1 : si64}, torch.onnx_meta.producer_name = "vai_q_onnx", torch.onnx_meta.producer_version = "1.17.0+43059a7"} {
    %1 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<0> : tensor<si8>} : () -> !torch.vtensor<[],si8> 
    %2 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<1.562500e-02> : tensor<f32>} : () -> !torch.vtensor<[],f32> 
    %3 = torch.operator "onnx.QuantizeLinear"(%arg2, %2, %1) : (!torch.vtensor<[1,128,56,56],f32>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,128,56,56],si8> 
    %4 = torch.operator "onnx.Conv"(%arg1, %arg3, %arg4) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [2 : si64, 2 : si64]} : (!torch.vtensor<[1,64,112,112],f32>, !torch.vtensor<[64,64,3,3],f32>, !torch.vtensor<[64],f32>) -> !torch.vtensor<[1,64,56,56],f32> 
    %5 = torch.operator "onnx.DequantizeLinear"(%3, %2, %1) : (!torch.vtensor<[1,128,56,56],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,128,56,56],f32> 
    %6 = torch.operator "onnx.Relu"(%4) : (!torch.vtensor<[1,64,56,56],f32>) -> !torch.vtensor<[1,64,56,56],f32> 
    %7 = torch.operator "onnx.Conv"(%6, %arg5, %arg6) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [1 : si64, 1 : si64], torch.onnx.pads = [0 : si64, 0 : si64, 0 : si64, 0 : si64], torch.onnx.strides = [1 : si64, 1 : si64]} : (!torch.vtensor<[1,64,56,56],f32>, !torch.vtensor<[128,64,1,1],f32>, !torch.vtensor<[128],f32>) -> !torch.vtensor<[1,128,56,56],f32> 
    %8 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<0> : tensor<si8>} : () -> !torch.vtensor<[],si8> 
    %9 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<1.562500e-02> : tensor<f32>} : () -> !torch.vtensor<[],f32> 
    %10 = torch.operator "onnx.QuantizeLinear"(%7, %9, %8) : (!torch.vtensor<[1,128,56,56],f32>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,128,56,56],si8> 
    %11 = torch.operator "onnx.DequantizeLinear"(%10, %9, %8) : (!torch.vtensor<[1,128,56,56],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,128,56,56],f32> 
    %12 = torch.operator "onnx.Add"(%11, %5) : (!torch.vtensor<[1,128,56,56],f32>, !torch.vtensor<[1,128,56,56],f32>) -> !torch.vtensor<[1,128,56,56],f32> 
    %13 = torch.operator "onnx.Relu"(%12) : (!torch.vtensor<[1,128,56,56],f32>) -> !torch.vtensor<[1,128,56,56],f32> 
    return %13 : !torch.vtensor<[1,128,56,56],f32>
  }
}

cc @MaheshRavishankar @IanWood1

IanWood1 · 2024-11-04T21:34:35Z

The matmul is coming from %7 = torch.operator "onnx.Conv" which gets converted to (compile-to=input):

%13 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%12, %4 : tensor<1x64x56x56xf32>, tensor<128x64x1x1xf32>) outs(%broadcasted_3 : tensor<1x128x56x56xf32>) -> tensor<1x128x56x56xf32>

And then generalized to the linalg.generic matmul-like op (from #18736). @nirvedhmeshram I'm not sure how this could be better represented or what the issue is, however.

nirvedhmeshram · 2024-11-05T15:36:41Z

The matmul is coming from %7 = torch.operator "onnx.Conv" which gets converted to (compile-to=input):

%13 = linalg.conv_2d_nchw_fchw {dilations = dense<1> : vector<2xi64>, strides = dense<1> : vector<2xi64>} ins(%12, %4 : tensor<1x64x56x56xf32>, tensor<128x64x1x1xf32>) outs(%broadcasted_3 : tensor<1x128x56x56xf32>) -> tensor<1x128x56x56xf32>
And then generalized to the linalg.generic matmul-like op (from #18736). @nirvedhmeshram I'm not sure how this could be better represented or what the issue is, however.

Thanks for taking a look, I dont think there is any reason it cant be supported but maybe just needs new configuration logic and see if anything breaks, so we can take a look at adding that.

jerryyin · 2024-11-05T15:48:03Z

So far my progress is:

Understood about the matmul like ir. It is a combination of:
- matmul like op that except that gemmB argument have one additional trailing dimension. If I concat [d1, d2] together, then this is exactly a matmul
- The trailing elementwise ops consists of two independent routes of quantize-dequantize compute, with results aggregated together
Can confirm I can reproduce the context of this ticket:
- The base case SIMT will hit into a failure at GPUCheckResourceUsagePass - It is quite understandable why this way lowering would exceed the limit of LDS size: The IR is attempting to load a fair amount of tensors naively from global to LDS.
- --iree-codegen-llvmgpu-test-tile-and-fuse-matmul would yield into exactly similar failure as baseline SIMT
- --iree-codegen-llvmgpu-test-tile-and-fuse-vectorize would successfully pass resource check and compile
Per @nirvedhmeshram's pointer, starting to look at why tileAndFuse matmul didn't succeed at here. The failure is around failing to set root config if I force it through

Per discussion above, it seems reasonable why we yield a matmul like op as we do from the particular convolution

The convolution is: filter: 128x64x1x1; input: 1x64x56x56; output: 1x128x56x56
The matmul like op is: gemmA: 128x64; gemmB: 64x56x56; gemmC: 128x56x56
The matmul like code snippet seems to be 1:1 mapping with the onnx code provided by @nirvedhmeshram except the first conv + relu. I believe the high level pipeline must have hoisted the below out of the function as the 3x3 conv + relu might be a common pattern around the high level graph execution.

%4 = torch.operator "onnx.Conv"(%arg1, %arg3, %arg4) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [2 : si64, 2 : si64]} : (!torch.vtensor<[1,64,112,112],f32>, !torch.vtensor<[64,64,3,3],f32>, !torch.vtensor<[64],f32>) -> !torch.vtensor<[1,64,56,56],f32> 
%6 = torch.operator "onnx.Relu"(%4) : (!torch.vtensor<[1,64,56,56],f32>) -> !torch.vtensor<[1,64,56,56],f32>

Next steps:

Compare with tileAndFuse matmul and tileAndFuse vectorze and see in particular how the two options impacted the LLVMGPUSelectLoweringStrategyPass, and understand why one pass but the other fail
Remove trailing ops from the matmul like IR and see whether tileAndFuse matmul is able to handle it

IanWood1 · 2024-11-05T17:14:55Z

matmul like op that except that gemmB argument have one additional trailing dimension. If I concat [d1, d2] together, then this is exactly a matmul

Yeah, this is something that should be handled by compiler/src/iree/compiler/DispatchCreation/CollapseDimensions.cpp but its currently blocked by some codegen issues (should be resolved shortly (#18822):

iree/compiler/src/iree/compiler/DispatchCreation/CollapseDimensions.cpp

Lines 174 to 189 in 9650bfe

    
           // TODO(#17948) GPU codegen fails when we collapse the dimensions of softmax. 
        
           if (llvm::any_of(genericOp.getDpsInputOperands(), 
        
                            [&](OpOperand *operand) -> bool { 
        
                              auto genericOperand = 
        
                                  operand->get().getDefiningOp<linalg::GenericOp>(); 
        
                              if (!genericOperand) 
        
                                return false; 
        
                              if (genericOperand.getNumReductionLoops() == 0) 
        
                                return false; 
        
                              return genericOp.getMatchingIndexingMap(operand) 
        
                                  .isProjectedPermutation(); 
        
                            })) { 
        
             return false; 
        
           }

I quickly commented this line out and was able to get it to compile successfully

jerryyin · 2024-11-05T18:27:48Z

@IanWood1 Thanks for putting extra effort in getting it to compile. It is not immediately clear to me which line you've commented out is able to make it work. But I assume you've made it such that 56x56 will be collapsed to a single dimension. With that single dimension tileAndFuse Matmul would be able to work correctly (versus, @nirvedhmeshram pointed out that he thinks 56x56 not divisible by 64 could be the problem, and kindly enough provided me the tracking ticket to make it work, here). Assuming I understand the background all clearly, what this means is that once #18822 get merged, the symptom of this ticket will go away as you've attempted. That's good news and re-assures me that this ticket will be gone even if I don't do anything about it :-p

I chatted with @nirvedhmeshram a moment ago, I think a few things still don't fully explain itself and worth following through on:

Between tileAndFuse and tileAndFuseMatmul, the IR before SelectLoweringStrategy are exactly the same. I've attempted to pull out that module and have iree-opt to just run through the standalone pass, and it is able to finish without any problem.
- This is counter intuitive, needs to get to the bottom of why the standalone pass pulled out on its own is able to compile
Per @nirvedhmeshram's suggestion, I should test 64x64 manually and see if full compilation can finish
Finish my next steps plan to get to confirm that the reason existing compilation failed is indeed what we assume it is

jerryyin · 2024-11-06T16:08:06Z

Summary and conclusions:

Why tileAndFuse matmul was failing?
It failed because it fails in deducing a valid MMA schedule (setMatmulLoweringConfig()). In particular, The gemm dimensions are [m: 56, n: 128, k: 64], and it is not divisible by 16x16x4 mfma instruction, according to below therefore giving up.

iree/compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp

Lines 233 to 235 in 842bcbc

    
           if (mustBeAligned && (problem.mSizes.back() % intrinsic.mSizes[0] != 0 || 
        
                                 problem.nSizes.back() % intrinsic.nSizes[0] != 0 || 
        
                                 problem.kSizes.back() % intrinsic.kSizes[0] != 0)) {

Why tileAndFuse vectorize can pass?
It is going down a different routine (setTileAndFuseLoweringConfig()). For the config size of [128, 56, 56], the picked workgroup tile size is [4, 8, 8]; per thread tile size is [1, 1, 4]. (side topic:) Considering that we got 64 threads per wave, this configuration is using 1 wave per workgroup. From the performance point of view, the occupancy of the generated MFA kernel seems to be quite poor.
Why would I be able to pass tileAndFuse matmul by pulling out IR before SelectLoweringStrategy pass?
Because I forget to use --iree-codegen-llvmgpu-test-tile-and-fuse-matmul argument when invoking iree-opt, therefore falling back to setVectorDistributionConfig(). Although the part I don't understand is about how to make sense about the lowering_config<tile_sizes = [[1, 64, 128, 64]]> for the dimension size of [128, 56, 56, 64]. I'd like to understand more about how conventional SIMT pass allocate tile.
What's the minimal change for the matmullike.mlir to make it pass?
Minimal change is to make the fastest changing dimension of gemmA tensor shape to be 64 instead of 56 to make it divisible by 16. The new shape should be: [k: 64, batch: 56, m: 64]. The updated IR to make it pass is here. I can confirm f32.16x16x4f32 mfma instructions generated in llvm dialect. However, if I make the non-fastest changing dimension to be 64, looks like it still fail to compile. I think this is a bug that should be addressed. With an arbitrary dimension divisible by 16 should all be permitted in existing tileAndFuse matmul pipeline.

For next steps this ticket is on hold, resolution of either below will make the problem in this ticket go away:

The workaround pointed by @IanWood1, last two dimensions collapsed together making it divisible by 16.
Padding support: desirably we should pick a single leftover dimension that requires least amount of paddings to make it divisible by 16.

nirvedhmeshram added this to IREE Compilation Errors Oct 25, 2024

This was referenced Oct 25, 2024

[gpu] 'func.func' op uses 401920 bytes of shared memory; exceeded the limit of 65536 bytes #18603

Open

[gpu][regression]: 'func.func' op failed to distribute #18876

Closed

jerryyin self-assigned this Nov 4, 2024

jerryyin moved this to In Progress in IREE Compilation Errors Nov 4, 2024

jerryyin added the codegen Shared code generation infrastructure and dialects label Nov 6, 2024

zjgarvey added this to the Scalability on AMD GPU milestone Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'func.func' op uses 873872 bytes of shared memory; exceeded the limit of 65536 bytes using LLVMGPUSIMT #18905

'func.func' op uses 873872 bytes of shared memory; exceeded the limit of 65536 bytes using LLVMGPUSIMT #18905

nirvedhmeshram commented Oct 25, 2024 •

edited

Loading

nirvedhmeshram commented Nov 4, 2024 •

edited

Loading

IanWood1 commented Nov 4, 2024

nirvedhmeshram commented Nov 5, 2024

jerryyin commented Nov 5, 2024 •

edited

Loading

IanWood1 commented Nov 5, 2024

jerryyin commented Nov 5, 2024

jerryyin commented Nov 6, 2024 •

edited

Loading

'func.func' op uses 873872 bytes of shared memory; exceeded the limit of 65536 bytes using LLVMGPUSIMT #18905

'func.func' op uses 873872 bytes of shared memory; exceeded the limit of 65536 bytes using LLVMGPUSIMT #18905

Comments

nirvedhmeshram commented Oct 25, 2024 • edited Loading

nirvedhmeshram commented Nov 4, 2024 • edited Loading

IanWood1 commented Nov 4, 2024

nirvedhmeshram commented Nov 5, 2024

jerryyin commented Nov 5, 2024 • edited Loading

IanWood1 commented Nov 5, 2024

jerryyin commented Nov 5, 2024

jerryyin commented Nov 6, 2024 • edited Loading

nirvedhmeshram commented Oct 25, 2024 •

edited

Loading

nirvedhmeshram commented Nov 4, 2024 •

edited

Loading

jerryyin commented Nov 5, 2024 •

edited

Loading

jerryyin commented Nov 6, 2024 •

edited

Loading