-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error: One or more operations with large vector sizes (8192 bytes) were found: #18677
Comments
There is a tensor.extract_slice created after distribution, which blocks the further TileAndFuse. So it generates large vectors. Someone needs to check why the extract_slice is created and fix it. If it is not fixable, we can try adding the // -----// IR Dump After LowerExecutableUsingTransformDialectPass (iree-codegen-lower-executable-using-transform-dialect) //----- //
module {
func.func @torch_jit$async_dispatch_24_unpack_elementwise_1x1000_f32_dispatch_0_unpack_elementwise_1x1000_f32() attributes {translation_info = #iree_codegen.translation_info<CPUDoubleTilingExpert>} {
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>>
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>>
%2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [1, 250, 8, 4], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>> -> tensor<1x250x8x4xf32>
%3 = tensor.empty() : tensor<1x1000xf32>
%unpack = tensor.unpack %2 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %3 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} : tensor<1x250x8x4xf32> -> tensor<1x1000xf32>
%4 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%unpack : tensor<1x1000xf32>) outs(%3 : tensor<1x1000xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} {
^bb0(%in: f32, %out: f32):
%5 = arith.addf %in, %cst : f32
linalg.yield %5 : f32
} -> tensor<1x1000xf32>
flow.dispatch.tensor.store %4, %1, offsets = [0, 0], sizes = [1, 1000], strides = [1, 1] : tensor<1x1000xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>>
return
}
}
// -----// IR Dump After TileAndDistributeToWorkgroupsPass (iree-codegen-tile-and-distribute-to-workgroups) //----- //
func.func @torch_jit$async_dispatch_24_unpack_elementwise_1x1000_f32_dispatch_0_unpack_elementwise_1x1000_f32() attributes {translation_info = #iree_codegen.translation_info<CPUDoubleTilingExpert>} {
%c250 = arith.constant 250 : index
%c1000 = arith.constant 1000 : index
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>>
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>>
%2 = flow.dispatch.tensor.load %0, offsets = [0, %c0, 0, 0], sizes = [1, %c250, 8, 4], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x250x8x4xf32>> -> tensor<1x?x8x4xf32>
%3 = tensor.empty() : tensor<8x1000xf32>
%unpack = tensor.unpack %2 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 4] into %3 {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} : tensor<1x?x8x4xf32> -> tensor<8x1000xf32>
%4 = tensor.empty() : tensor<1x1000xf32>
%extracted_slice = tensor.extract_slice %unpack[0, 0] [1, 1000] [1, 1] : tensor<8x1000xf32> to tensor<1x1000xf32>
%5 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<1x1000xf32>) outs(%4 : tensor<1x1000xf32>) attrs = {lowering_config = #iree_codegen.lowering_config<tile_sizes = [[0, 1000], [8, 4], [0, 0], [0, 0]]>} {
^bb0(%in: f32, %out: f32):
%6 = arith.addf %in, %cst : f32
linalg.yield %6 : f32
} -> tensor<1x1000xf32>
%cast = tensor.cast %5 : tensor<1x1000xf32> to tensor<1x?xf32>
flow.dispatch.tensor.store %cast, %1, offsets = [0, %c0], sizes = [1, %c1000], strides = [1, 1] : tensor<1x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x1000xf32>>
return
}
|
good point, which made me notice, isnt this unpack wrong?
it should be
|
This issue seems to be a friend of #18603 we have
cc @zjgarvey |
This unpack is valid because there is extract_slice semantic in unpack ops. You can think that it is an inverse operation of pack op. The pack op has padding semantics, and the unpack op has extract_slice semantics. It is valid to fold %unpack = tensor.unpack %0
outer_dims_perm = [0, 1]
inner_dims_pos = [0, 1]
inner_tiles = [8, 4] into %1 : tensor<1x250x8x4xf32> -> tensor<1x1000xf32> |
Thanks that makes sense, however not sure if we intended to reach this unpack with the extract slice semantics or its a bug because of having shape computation encoded in tensor math. |
@nirvedhmeshram I'll focus on getting the where.self op to return scalar arithmetic when possible. |
Sounds good, I will check if we want to support unpack with extract slice fusion as well and either add that support or disable this fusion in such cases based on where that discussion goes. |
Yes, it is intended. It is not a bug. I guess the behavior is triggered by the tiling implementation: https://github.com/llvm/llvm-project/blob/fc4b1a303b296d02f6243a083510c4ee7f290ab0/mlir/lib/Dialect/Tensor/IR/TensorTilingInterfaceImpl.cpp#L561-L588 Looking at the implementation, I think the issue is that it is not treated as perfect tiling case. We can try to enhance the logic. My guess is that the value of One way to enhance the logic is passing the size of destination tensor to the getUnpackTileDimInfo function. If the sizes match, it returns the perfect tiling config. |
… generic ops (#3762) This is motivated by the fact that shapes are stored as tensors in ONNX, and IREE tries to perform tensor arithmetic on the device. This causes unnecessary dispatches, and makes it harder for the compiler to reason about shapes. Here is a small snippet of torch-IR that is typical seen coming from ONNX models: ```mlir module { func.func @main_graph(%arg0: !torch.vtensor<[?,?,768],f32>, %arg1: !torch.vtensor<[?,?,768],f32>) -> !torch.vtensor<[],si64> { %int0 = torch.constant.int 0 %0 = torch.vtensor.literal(dense<0> : tensor<1xsi64>) : !torch.vtensor<[1],si64> %1 = torch.aten._shape_as_tensor %arg1 : !torch.vtensor<[?,?,768],f32> -> !torch.vtensor<[3],si64> %2 = torch.aten.index_select %1, %int0, %0 : !torch.vtensor<[3],si64>, !torch.int, !torch.vtensor<[1],si64> -> !torch.vtensor<[1],si64> %3 = torch.aten.squeeze.dim %2, %int0 : !torch.vtensor<[1],si64>, !torch.int -> !torch.vtensor<[],si64> %4 = torch.aten.item %3 : !torch.vtensor<[],si64> -> !torch.int %5 = torch.aten.eq.int %4, %int0 : !torch.int, !torch.int -> !torch.bool %6 = torch.aten.Int.bool %5 : !torch.bool -> !torch.int %7 = torch.aten.size.int %arg0, %int0 : !torch.vtensor<[?,?,768],f32>, !torch.int -> !torch.int %8 = torch.prim.NumToTensor.Scalar %6 : !torch.int -> !torch.vtensor<[],i1> %9 = torch.prim.NumToTensor.Scalar %7 : !torch.int -> !torch.vtensor<[],si64> %10 = torch.prim.NumToTensor.Scalar %4 : !torch.int -> !torch.vtensor<[],si64> %11 = torch.aten.where.self %8, %9, %10 : !torch.vtensor<[],i1>, !torch.vtensor<[],si64>, !torch.vtensor<[],si64> -> !torch.vtensor<[],si64> return %11 : !torch.vtensor<[],si64> } } ``` Without the change in this PR, the result would be: ```mlir #map = affine_map<() -> ()> module { ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64> func.func @main_graph(%arg0: tensor<?x?x768xf32>, %arg1: tensor<?x?x768xf32>) -> tensor<i64> { %c0_i64 = arith.constant 0 : i64 %c0 = arith.constant 0 : index %dim = tensor.dim %arg1, %c0 : tensor<?x?x768xf32> %0 = arith.index_cast %dim : index to i64 %1 = tensor.empty() : tensor<1xi64> %collapsed = tensor.collapse_shape %1 [] : tensor<1xi64> into tensor<i64> %2 = linalg.fill ins(%0 : i64) outs(%collapsed : tensor<i64>) -> tensor<i64> %extracted = tensor.extract %2[] : tensor<i64> %3 = arith.cmpi eq, %extracted, %c0_i64 : i64 %dim_0 = tensor.dim %arg0, %c0 : tensor<?x?x768xf32> %4 = arith.index_cast %dim_0 : index to i64 %5 = tensor.empty() : tensor<i1> %6 = linalg.fill ins(%3 : i1) outs(%5 : tensor<i1>) -> tensor<i1> %7 = tensor.empty() : tensor<i64> %8 = linalg.fill ins(%4 : i64) outs(%7 : tensor<i64>) -> tensor<i64> %9 = linalg.fill ins(%extracted : i64) outs(%7 : tensor<i64>) -> tensor<i64> %10 = linalg.generic {indexing_maps = [#map, #map, #map, #map], iterator_types = []} ins(%6, %8, %9 : tensor<i1>, tensor<i64>, tensor<i64>) outs(%7 : tensor<i64>) { ^bb0(%in: i1, %in_1: i64, %in_2: i64, %out: i64): %11 = arith.select %in, %in_1, %in_2 : i64 linalg.yield %11 : i64 } -> tensor<i64> return %10 : tensor<i64> } } ``` With the change in this PR, we would instead get: ```mlir module { ml_program.global private mutable @global_seed(dense<0> : tensor<i64>) : tensor<i64> func.func @main_graph(%arg0: tensor<?x?x768xf32>, %arg1: tensor<?x?x768xf32>) -> tensor<i64> { %c0_i64 = arith.constant 0 : i64 %c0 = arith.constant 0 : index %dim = tensor.dim %arg1, %c0 : tensor<?x?x768xf32> %0 = arith.index_cast %dim : index to i64 %1 = tensor.empty() : tensor<1xi64> %collapsed = tensor.collapse_shape %1 [] : tensor<1xi64> into tensor<i64> %2 = linalg.fill ins(%0 : i64) outs(%collapsed : tensor<i64>) -> tensor<i64> %extracted = tensor.extract %2[] : tensor<i64> %3 = arith.cmpi eq, %extracted, %c0_i64 : i64 %dim_0 = tensor.dim %arg0, %c0 : tensor<?x?x768xf32> %4 = arith.index_cast %dim_0 : index to i64 %5 = arith.select %3, %4, %extracted : i64 %6 = tensor.empty() : tensor<i64> %7 = linalg.fill ins(%5 : i64) outs(%6 : tensor<i64>) -> tensor<i64> return %7 : tensor<i64> } } ``` Some related issues for context: 1. <iree-org/iree#18677> 2. <iree-org/iree#18631>
This is just merely based on the observation of this IR. Since we are adding zeros, can't we DCE the linalg.generic ? |
Good question. I honestly don't know where should it happen. It is not easy to identify these cases (e.g., transpose, etc.) at Linalg level, so we typically rely on ConstEval. It's easier if we can do it at higher level (like arith, or input dialects). |
Actually, this came up in a voice meeting we had last week and we still want to support this dispatch because often the constant is zero due to fake weights but might not be zero in real use cases. |
This is fixed with the latest pull. PTAL. |
this issue is not seen anymore with latest so closing this |
What happened?
For the attached IR, seeing error as
command:
and this linalg.mlir was generated with following command:
linalg.mlir.txt
model.torch_onnx.mlir.txt
Steps to reproduce your issue
Mentioned above
What component(s) does this issue relate to?
Compiler
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: