Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BYOC][ACL] Fix list is not supported as an input node #10801

Merged
merged 14 commits into from
Apr 14, 2022
25 changes: 24 additions & 1 deletion python/tvm/relay/op/contrib/arm_compute_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@
from .register import register_pattern_table


# global variable control wether offload concatenate
offload_concat_ = False


def is_arm_compute_runtime_enabled():
"""Check if the ACL graph executor is present.

Expand All @@ -42,7 +46,7 @@ def is_arm_compute_runtime_enabled():
return False


def partition_for_arm_compute_lib(mod, params=None, **opts):
def partition_for_arm_compute_lib(mod, params=None, offload_concat=False, **opts):
"""Partition the graph greedily offloading supported
operators to Arm Compute Library.

Expand All @@ -52,11 +56,15 @@ def partition_for_arm_compute_lib(mod, params=None, **opts):
The module to run passes on.
params : Optional[Dict[str, NDArray]]
Constant input parameters.
offload_concat : Optional[bool]
Whether offload concatenate

Returns
-------
ret : annotated and partitioned module.
"""
global offload_concat_
Copy link
Member

@masahi masahi Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using a global var, can we make use of arm_compute_lib_pattern_table?

def arm_compute_lib_pattern_table():

i.e. register concat pattern if offload_concat == True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @masahi. Currently ACL has a mixture of registering operations using composite functions and _register_external_op_helper which adds the target.arm_compute_lib attribute, which is not ideal. I think if we wanted to do this for concat it would need to be registered using pattern table rather than the other mechanism.

In the future we should probably use the pattern table for all operations.

Copy link
Member

@masahi masahi Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we are adding a new op support, why not add it via pattern like

Then it is easy to enable / disable a pattern based on some flags. @DzAvril

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, please review. I'm not sure it is what you want because it seems a little weird since arm_compute_lib_pattern_table aims to register fusing patterns for ops and registering op attr is out of it.
BTW, this commit triggered a Sanity Check error:

python/tvm/relay/op/contrib/arm_compute_lib.py:274: [W0612(unused-variable), arm_compute_lib_pattern_table.concatenate] Unused variable 'concatenate'

Copy link
Member

@masahi masahi Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't know what @tvm.ir.register_op_attr("nn.conv2d", "target.arm_compute_lib") thing is for... Why do you need both the pattern + check function at


def check_conv(extract):

and register_op_attr thing at

@tvm.ir.register_op_attr("nn.conv2d", "target.arm_compute_lib")
def conv2d(expr):

?

I never needed the latter in the BYOCs I worked on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't know what @tvm.ir.register_op_attr("nn.conv2d", "target.arm_compute_lib") thing is for...

Gently ping @lhutton1, can you help to explain this to masai?

Copy link
Member

@masahi masahi Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DzAvril, @lhutton1 said "I think if we wanted to do this for concat it would need to be registered using pattern table rather than the other mechanism". This is also what I'm suggesting above. Can you do that? You shouldn't need @tvm.ir.register_op_attr thing for concat. You just need to add is_op("concatenate") etc + check function which you already have under @tvm.ir.register_op_attr(concatenate)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't know what @tvm.ir.register_op_attr("nn.conv2d", "target.arm_compute_lib")

Yeah I don't think we need to register conv2d like this either as it should have already been picked up by the pattern table. This way of registering operators predates the pattern table so I guess this was just overlooked. We can follow up in a separate PR to move all operators so that they are registered using the pattern table - which I believe is the way other BYOC's are going. Apologies for the confusion @DzAvril, like @masahi said, it would be best to register concatenate (and other operators added in the future) using the pattern table since this is the preferred mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing some comments above. I have a question about the concatenate_pattern, see code below.

    def concatenate_pattern():
        """Create an concatenate pattern from equivalent relay operators.

        Returns
        -------
        pattern : dataflow_pattern.AltPattern
            Denotes the concatenate pattern.
        """
        pattern = is_op("concatenate")(wildcard())
        return pattern

As the input of concatenate is a tuple of tensors, what the pattern of the parameters should be like? wildcard() in the code block is not working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try is_tuple(None)? See #7754

offload_concat_ = offload_concat
if params:
mod["main"] = bind_params_by_name(mod["main"], params)

Expand Down Expand Up @@ -490,6 +498,21 @@ def qnn_add(expr):
return True


@tvm.ir.register_op_attr("concatenate", "target.arm_compute_lib")
def concatenate(expr):
"""Check if the external ACL codegen for concatenate should be used."""
if not offload_concat_:
return False
attrs, type_args = expr.attrs, expr.type_args
for idx in range(len(type_args[0].fields)):
if type_args[0].fields[idx].dtype not in ["float32", "uint8"]:
return False
# ACL concatenate only supports maximum 4 dimensions input tensor
if attrs.axis not in [-4, -3, -2, -1, 0, 1, 2, 3]:
return False
return True


class OpAttrContext(object):
"""Temporarily changes the attr of an op."""

Expand Down
82 changes: 70 additions & 12 deletions src/runtime/contrib/arm_compute_lib/acl_runtime.cc
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
#ifdef TVM_GRAPH_EXECUTOR_ARM_COMPUTE_LIB
#include <arm_compute/core/Types.h>
#include <arm_compute/runtime/NEON/functions/NEArithmeticAddition.h>
#include <arm_compute/runtime/NEON/functions/NEConcatenateLayer.h>
#include <arm_compute/runtime/NEON/functions/NEConvolutionLayer.h>
#include <arm_compute/runtime/NEON/functions/NEDepthwiseConvolutionLayer.h>
#include <arm_compute/runtime/NEON/functions/NEElementwiseOperations.h>
Expand Down Expand Up @@ -91,12 +92,21 @@ class ACLRuntime : public JSONRuntimeBase {
* \return Status of inference.
*/
void Run() override {
for (size_t i = 0; i < input_nodes_.size(); ++i) {
auto nid = input_nodes_[i];
uint32_t eid = EntryID(nid, 0);
for (size_t nid_idx = 0; nid_idx < input_nodes_.size(); ++nid_idx) {
auto nid = input_nodes_[nid_idx];
if (nodes_[nid].GetOpType() == "input") {
void* data = data_entry_[eid]->data;
CheckACLError(layer_.inputs[i].allocator()->import_memory(data));
for (uint32_t eid_idx = 0; eid_idx < nodes_[nid].GetNumOutput(); eid_idx++) {
uint32_t eid = EntryID(nid, eid_idx);
void* data = data_entry_[eid]->data;
auto key = std::pair<uint32_t, uint32_t>(nid, eid_idx);
if (layer_.json_inputid_to_layer_inputid.count(key) > 0) {
CheckACLError(
layer_.inputs[layer_.json_inputid_to_layer_inputid[key]].allocator()->import_memory(
data));
} else {
CheckACLError(layer_.inputs[nid_idx].allocator()->import_memory(data));
}
}
}
}

Expand Down Expand Up @@ -149,6 +159,8 @@ class ACLRuntime : public JSONRuntimeBase {
CreateMaximumLayer(&layer_, node);
} else if ("add" == op_name || "qnn.add" == op_name) {
CreateAddLayer(&layer_, node);
} else if ("concatenate" == op_name) {
CreateConcatenateLayer(&layer_, node);
} else {
LOG(FATAL) << "Unsupported op: " << op_name;
}
Expand All @@ -166,6 +178,9 @@ class ACLRuntime : public JSONRuntimeBase {
std::shared_ptr<arm_compute::IFunction> function;
std::vector<arm_compute::Tensor> inputs;
std::vector<arm_compute::Tensor> outputs;
// maps the input index of JSON node to the index of the ACL layer's inputs
// this is optional (i.e.only when an operator uses the eid index)
std::map<std::pair<uint32_t, uint32_t>, uint32_t> json_inputid_to_layer_inputid;
DzAvril marked this conversation as resolved.
Show resolved Hide resolved
};

/*!
Expand All @@ -175,17 +190,25 @@ class ACLRuntime : public JSONRuntimeBase {
* \param tensor The tensor to represent.
* \param scale (optional) The scale of the tensor as an input.
* \param offset (optional) The offset of the tensor as an input.
* \param apply_dim_correction (Optional) Flag to state whether apply dimension correction after
* setting one dimension. E.g. when permuting NCHW -> NHWC, 1x1x2 would become 2x1x1, but
* _num_dimensions should be 3 rather than 1.
* \param increase_dim_unit (Optional) Set to true if new unit dimensions increase the number of
* dimensions of the shape.
* \return ACL Tensor.
*/
arm_compute::Tensor MakeACLTensorFromJSONEntry(const JSONGraphNodeEntry& tensor,
JSONGraphNodeEntry* scale = nullptr,
JSONGraphNodeEntry* offset = nullptr) {
JSONGraphNodeEntry* offset = nullptr,
bool apply_dim_correction = true,
bool increase_dim_unit = true) {
JSONGraphNode node = nodes_[tensor.id_];
void* node_data = nullptr;
if (node.GetOpType() == "const") {
node_data = data_entry_[EntryID(tensor)]->data;
}
return MakeACLTensorFromJSONNode(node, scale, offset, node_data);
return MakeACLTensorFromJSONNode(node, scale, offset, node_data, apply_dim_correction,
increase_dim_unit, tensor.index_);
}

/*!
Expand All @@ -196,19 +219,26 @@ class ACLRuntime : public JSONRuntimeBase {
* \param scale (optional) The scale of the tensor as an input.
* \param offset (optional) The offset of the tensor as an input.
* \param data (optional) Constant data of input node.
* \param apply_dim_correction (Optional) Flag to state whether apply dimension correction after
* setting one dimension. E.g. when permuting NCHW -> NHWC, 1x1x2 would become 2x1x1, but
* _num_dimensions should be 3 rather than 1.
* \param increase_dim_unit (Optional) Set to true if new unit dimensions increase the number of
* dimensions of the shape.
* \param entry_index The entry index.
* \return ACL Tensor.
*/
arm_compute::Tensor MakeACLTensorFromJSONNode(const JSONGraphNode& node,
JSONGraphNodeEntry* scale = nullptr,
JSONGraphNodeEntry* offset = nullptr,
void* data = nullptr) {
arm_compute::Tensor MakeACLTensorFromJSONNode(
const JSONGraphNode& node, JSONGraphNodeEntry* scale = nullptr,
JSONGraphNodeEntry* offset = nullptr, void* data = nullptr, bool apply_dim_correction = true,
bool increase_dim_unit = true, uint32_t entry_index = 0) {
const DLTensor* scale_data = nullptr;
const DLTensor* offset_data = nullptr;
if (scale && offset) {
scale_data = data_entry_[EntryID(*scale)];
offset_data = data_entry_[EntryID(*offset)];
}
return MakeACLTensor(node, data, scale_data, offset_data);
return MakeACLTensor(node, data, scale_data, offset_data, apply_dim_correction,
increase_dim_unit, entry_index);
}

/*!
Expand Down Expand Up @@ -510,6 +540,34 @@ class ACLRuntime : public JSONRuntimeBase {
layer->function = f;
}

/*!
* \brief Create a Concatenate layer.
*
* \param layer The ACL layer to build. Containing inputs, outputs and the ACL function.c
* \param node The JSON representation of the operator.
*/
void CreateConcatenateLayer(CachedLayer* layer, const JSONGraphNode& node) {
std::vector<std::string> axis = node.GetAttr<std::vector<std::string>>("axis");
std::vector<const arm_compute::ITensor*> inputs;
for (auto input : node.GetInputs()) {
layer->inputs.push_back(MakeACLTensorFromJSONEntry(input, nullptr, nullptr, false));
layer->json_inputid_to_layer_inputid[std::pair<uint32_t, uint32_t>(input.id_, input.index_)] =
layer->inputs.size() - 1;
}
for (size_t i = 0; i < layer->inputs.size(); i++) {
inputs.push_back(&layer->inputs[i]);
}
layer->outputs.push_back(MakeACLTensorFromJSONNode(node));
int dimNum = layer->inputs[0].info()->num_dimensions();
auto function = std::make_shared<arm_compute::NEConcatenateLayer>();
// the shape of input tensor will be reversed after passing to ACL
// for example a tensor with shape [1, 2, 3, 4] will be changed to
// [4, 3, 2, 1] at ACL side. So the axis here should be preprocessed.
auto a = std::stoi(axis[0]);
function->configure(inputs, &layer->outputs[0], a < 0 ? -a - 1 : dimNum - a - 1);
layer->function = function;
}

/*! \brief Allow ACL functions to request auxiliary memory from TVM. */
ACLAllocator allocator_;
/*!
Expand Down
16 changes: 10 additions & 6 deletions src/runtime/contrib/arm_compute_lib/acl_utils.cc
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,14 @@ void CheckACLError(const arm_compute::Status& status) {
}

arm_compute::Tensor MakeACLTensor(const JSONGraphNode& tensor_rep, void* data,
const DLTensor* scale, const DLTensor* offset) {
const DLTensor* scale, const DLTensor* offset,
bool apply_dim_correction, bool increase_dim_unit,
uint32_t entry_index) {
arm_compute::Tensor tensor;
std::vector<int64_t> shape = tensor_rep.GetOpShape()[0];
DLDataType dtype = tensor_rep.GetOpDataType()[0];
arm_compute::TensorInfo info = MakeACLTensorInfo(shape, dtype, scale, offset);
std::vector<int64_t> shape = tensor_rep.GetOpShape()[entry_index];
DLDataType dtype = tensor_rep.GetOpDataType()[entry_index];
arm_compute::TensorInfo info =
MakeACLTensorInfo(shape, dtype, scale, offset, apply_dim_correction, increase_dim_unit);
info.set_is_resizable(false);
tensor.allocator()->init(info);
if (data != nullptr) {
Expand All @@ -55,10 +58,11 @@ arm_compute::Tensor MakeACLTensor(const JSONGraphNode& tensor_rep, void* data,

arm_compute::TensorInfo MakeACLTensorInfo(const std::vector<int64_t>& shape,
const DLDataType& dtype, const DLTensor* scale,
const DLTensor* offset) {
const DLTensor* offset, bool apply_dim_correction,
bool increase_dim_unit) {
arm_compute::TensorShape acl_shape;
for (unsigned int i = shape.size(); i > 0; --i) {
acl_shape.set(shape.size() - i, shape[i - 1]);
acl_shape.set(shape.size() - i, shape[i - 1], apply_dim_correction, increase_dim_unit);
}
arm_compute::DataType acl_dtype = MakeACLDataType(dtype);
arm_compute::TensorInfo info(acl_shape, 1, acl_dtype, arm_compute::DataLayout::NHWC);
Expand Down
9 changes: 6 additions & 3 deletions src/runtime/contrib/arm_compute_lib/acl_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,9 @@ void CheckACLError(const arm_compute::Status& status);
* \return arm_compute::Tensor.
*/
arm_compute::Tensor MakeACLTensor(const JSONGraphNode& tensor_rep, void* data = nullptr,
const DLTensor* scale = nullptr,
const DLTensor* offset = nullptr);
const DLTensor* scale = nullptr, const DLTensor* offset = nullptr,
bool apply_dim_correction = true, bool increase_dim_unit = true,
uint32_t entry_index = 0);

/*!
* \brief Make an acl tensor info object from JSON tensor
Expand All @@ -78,7 +79,9 @@ arm_compute::Tensor MakeACLTensor(const JSONGraphNode& tensor_rep, void* data =
*/
arm_compute::TensorInfo MakeACLTensorInfo(const std::vector<int64_t>& shape,
const DLDataType& dtype, const DLTensor* scale = nullptr,
const DLTensor* offset = nullptr);
const DLTensor* offset = nullptr,
bool apply_dim_correction = true,
bool increase_dim_unit = true);

/*!
* \brief Create a memory manager for use with a layer that
Expand Down
1 change: 1 addition & 0 deletions src/runtime/contrib/json/json_runtime.h
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,7 @@ class JSONRuntimeBase : public ModuleNode {
for (size_t j = 0; j < nodes_[nid].GetOpShape().size(); ++j) {
input_var_eid_.push_back(EntryID(nid, j));
}
nodes_[nid].SetNumOutput(nodes_[nid].GetOpShape().size());
} else {
ICHECK_EQ(nodes_[nid].op_type_, "const");
auto pos = std::find(std::begin(const_names_), std::end(const_names_), name);
Expand Down
22 changes: 18 additions & 4 deletions tests/python/contrib/test_arm_compute_lib/infrastructure.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,13 +163,17 @@ def skip_codegen_test():
return True


def build_module(mod, target, params=None, enable_acl=True, tvm_ops=0, acl_partitions=1):
def build_module(
mod, target, params=None, enable_acl=True, tvm_ops=0, acl_partitions=1, offload_concat=False
):
"""Build module with option to build for ACL."""
if isinstance(mod, tvm.relay.expr.Call):
mod = tvm.IRModule.from_expr(mod)
with tvm.transform.PassContext(opt_level=3, disabled_pass=["AlterOpLayout"]):
if enable_acl:
mod = arm_compute_lib.partition_for_arm_compute_lib(mod, params)
mod = arm_compute_lib.partition_for_arm_compute_lib(
mod, params, offload_concat=offload_concat
)
tvm_op_count = get_cpu_op_count(mod)
assert tvm_op_count == tvm_ops, "Got {} TVM operators, expected {}".format(
tvm_op_count, tvm_ops
Expand Down Expand Up @@ -199,13 +203,16 @@ def build_and_run(
tvm_ops=0,
acl_partitions=1,
config=None,
offload_concat=False,
):
"""Build and run the relay module."""
if config is None:
config = {}

try:
lib = build_module(mod, device.target, params, enable_acl, tvm_ops, acl_partitions)
lib = build_module(
mod, device.target, params, enable_acl, tvm_ops, acl_partitions, offload_concat
)
except Exception as e:
err_msg = "The module could not be built.\n"
if config:
Expand Down Expand Up @@ -276,9 +283,16 @@ def verify_codegen(
num_acl_modules=1,
tvm_ops=0,
target="llvm -mtriple=aarch64-linux-gnu -mattr=+neon",
offload_concat=False,
):
"""Check acl codegen against a known good output."""
module = build_module(module, target, tvm_ops=tvm_ops, acl_partitions=num_acl_modules)
module = build_module(
module,
target,
tvm_ops=tvm_ops,
acl_partitions=num_acl_modules,
offload_concat=offload_concat,
)
acl_modules = extract_acl_modules(module)

assert len(acl_modules) == num_acl_modules, (
Expand Down
Loading