-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AUTOTVM] TOPI integration for ARM CPU #1487
Conversation
apps/benchmark/README.md
Outdated
@@ -0,0 +1,123 @@ | |||
# Performance Benchmark | |||
|
|||
## ARM CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider put performance benchmark results in wiki for now, later we can have hosted website for the result, because they can change over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I edit the wiki?
apps/benchmark/README.md
Outdated
Note: If a board has big.LITTLE archtiecture, we will use all big cores. | ||
Otherwise, we will use all cores. | ||
|
||
- **Firefly-RK3399 : 2 x Cortex A73 1.8Ghz+ 4 x Cortex A53 1.5Ghz** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only mark the cores being used(in this case big)
apps/benchmark/README.md
Outdated
parameters in [this repo](https://github.com/uwsaml/tvm-distro). | ||
During compilation, TVM will download these operator parameters automatically. | ||
|
||
But we don't tune for other devices, so you can only run benchmark for these devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line, and after this, have a quick section on how to do tuning for a new device
cc who may be interested in PR |
python/tvm/autotvm/record.py
Outdated
@@ -239,7 +245,8 @@ def load(self, records): | |||
best_by_model[key] = (inp, res) | |||
break | |||
|
|||
logging.info("Finish loading %d records", counter) | |||
if verbose: | |||
logging.info("Finish loading %d records", counter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider just use logging.debug?
|
||
def tune_tasks(tasks, | ||
rpc_device_key, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no empty line between arguments
early_stopping=200, | ||
log_filename='tuning.log', | ||
|
||
mea_number=5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mea-> measure
mea_number=5, | ||
mea_parallel_num=1, | ||
mea_timeout=20, | ||
mea_use_ndk=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to pass in MeasureOption here? The options seem to be a bit duplicating with MeasureOption
len(xs) - np.sum(valid_index), | ||
self.feature_cache.size(self.fea_type)) | ||
if self.verbose: | ||
logging.info("train: %.2f\tobs: %d\terror: %d\tn_cache: %d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider use logging.debug and allow user to set the level
@@ -40,16 +40,21 @@ class XGBTuner(ModelBasedTuner): | |||
If is not None, the tuner will first select | |||
top-(plan_size * diversity_filter_ratio) candidates according to the cost model | |||
and then pick batch_size of them according to the diversity metric. | |||
verbose: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider directly rely on logging level for verbosity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is int not bool, so we leave this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verbose usually do not have the same meaning, so this argument is confusing. A better name is log_interval.
python/tvm/target.py
Outdated
} | ||
pre_defined_opt = opt_table.get(model, []) | ||
|
||
if not os.path.isfile(os.path.join(AUTOTVM_PRETUNED_PARAM_ROOT_PATH, "arm_cpu.log")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consolidate all the logics of file system manipulation and autotvm cache into one file, say autotvm.tophub
topi/python/topi/nn/conv2d.py
Outdated
The raw kernel tensor | ||
tile_size: int | ||
Tile size of winograd transform. e.g. 2 for F(2x2, 3x3) and 4 for F(4x4, 3x3) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add return arguments
tutorials/autotvm/tune_nnvm_arm.py
Outdated
these operators, it will query this log file to get the best knob values. | ||
|
||
We also released pre-tuned parameters for some arm devices. You can go to | ||
`ARM CPU Benchmark <https://github.com/merrymercy/tvm/blob/arm_cpu/apps/benchmark/README.md#arm-cpu>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to the master version
Please also confirm the VTA CPU test cases, since these depend on availability of old rasp schedule which get removed here |
apps/benchmark/README.md
Outdated
|
||
E.g. For my RK3399, I use `python3 -m tvm.exec.rpc_sever --tracker=10.77.1.123:9190 --key=rk3399` | ||
|
||
* For Andoird device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Android
apps/benchmark/README.md
Outdated
``` | ||
|
||
If you do not do tuning and run the benchmark for other devices directly, | ||
the performance is not gauranteed (This is still doable, you can pick a most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: guaranteed
src/pass/vectorize_loop.cc
Outdated
@@ -300,7 +300,6 @@ class Vectorizer : public IRMutator { | |||
CHECK(!op->condition.type().is_vector()); | |||
Expr condition = this->Mutate(op->condition); | |||
if (condition.type().is_vector()) { | |||
LOG(WARNING) << "Detect vector condition in Vectorized Loop, scalarizing..."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it seems to pollute the logged data; ideally we would just print this once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is important, because it tells us when vectorization isn't working because of vectorize axis length vs input shape mismatch.
I'd imagine this message will mess up log during auto tuning, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I reverted this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not pollute logging data. It occurs when I use "llvm" as target to build resnet-18.
https://github.com/dmlc/tvm/blob/f33fd5c03d8a2b3972e3b69a79a89d0c9754cd9e/topi/python/topi/x86/conv2d.py#L214-L218
@masahi Can I fix this by checking the length of w
and only vectorize it when the length of w
is a multiple of 16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merrymercy Sure. I am aware of this issue. Probably 8 is a better default split factor than 16 for imagenet model.
I am planning to remove this old schedule completely and adapt AVX schedules for SSE target.
nnvm/src/top/nn/convolution.cc
Outdated
// param.kernel_size[1]}); | ||
// wshape = ConvertLayout(wshape, kOIHW, kernel_layout); | ||
// wshape[kernel_layout.indexof('O')] *= param.groups; | ||
// NNVM_ASSIGN_INPUT_SHAPE(attrs, *in_shape, Conv2DParam::kWeight, wshape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of commenting out, I'd sugguest remove them and leave more informative comment on why you don't do weight shape inference here.
topi/python/topi/arm_cpu/conv2d.py
Outdated
pre_packed = False | ||
CO, _, KH, KW = get_const_tuple(kernel.shape) | ||
else: | ||
pre_packed = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd sugguest pre_packed -> pre_computed, as this is not simply pre packing.
topi/python/topi/arm_cpu/conv2d.py
Outdated
copy_inputs[1] = weight | ||
new_attrs['tile_size'] = tile_size | ||
return sym.contrib.conv2d_winograd_without_weight_transform(*copy_inputs, **new_attrs) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for else: block here. I think lint should catch this.
topi/python/topi/arm_cpu/conv2d.py
Outdated
return sym.contrib.conv2d_winograd_without_weight_transform(*copy_inputs, **new_attrs) | ||
else: | ||
# do nothing for depthwise convolution | ||
return sym.conv2d(*copy_inputs, **new_attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to return None here. When I was doing cuda winograd, returning a new conv2d symbol here caused a strange issue during InferShape. Returning None here solved the issue for me.
@merrymercy For winograd input/output transform, I was able to achieve minimal amount of math, like this for F(2x2, 3x3), for example.
For F(2x2, 3x3), this reduces the number of add/sub for each 4x4 input tile from 64 to 32. Similar reduction exists for F(4x4, 3x3) and it is even more effective. It also allows completely removing matmul from compute definition of input/output transform. Check out here for a simple test case for this and here for how I integrated this reduction to my implementation of x86 F(4x4, 3x3). |
topi/python/topi/arm_cpu/conv2d.py
Outdated
s[V].unroll(r_nu) | ||
s[V].parallel(b) | ||
s[DD].compute_at(s[V], bb) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add vectorize here somehow? I'm using different layout from yours, but I can do vectroized input/output transform. My implementation is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, would we expect vectorization coverage from this template already? e.g., if a configuration produces an easy-to-vectorize pattern here, would we expect llvm
to vectorize already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think llvm can auto-vectorize this.
topi/python/topi/arm_cpu/conv2d.py
Outdated
co, vc = cfg.define_split('tile_co', co, num_outputs=2) | ||
oh, vh = cfg.define_split('tile_oh', oh, num_outputs=2) | ||
ow, vw = cfg.define_split('tile_ow', ow, num_outputs=2) | ||
elif num_tile == 3: # for gpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems irrelevant for arm cpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is for ARM Mali GPU.. They can share this function. But I didn't send code of mali in this PR
topi/python/topi/generic/nn.py
Outdated
Parameters | ||
---------- | ||
outs: Array of Tensor | ||
The computation graph description of conv2d_nchw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conv2d_nchw -> conv2d_winograd_weight_transform
for network in networks: | ||
net, params, shape, out_shape = get_network(network, batch_size=1) | ||
|
||
with nnvm.compiler.build_config(opt_level=2, add_pass=['AlterOpLayout']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't your case (AlterOpLayout optimization) not enter into conv2d_NCHWc and x86 schedule? which is only suitable for x86 now. At least for me, it will report error.
return s | ||
|
||
|
||
@conv2d_alter_layout.register(["arm_cpu", "mali"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FrozenGene I registered alter_layout for arm_cpu here. I didn't get any error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FrozenGene I added pre-tuned parameters for pynq board, which has a Cortex-A9 cpu.
@masahi Did you test the performance difference between non-minimal math version and minimal math version? I tried your compute deceleration but cannot get speedup. Have llvm handled this case already? |
@merrymercy yes, I have scripts to compare minimal version vs non minimal version. You can run them yourself to see the difference. The scripts will dump total execution time as well as time taken for input transform, batched gemm, and output transform separately. Obviously, if your winograd kernel is completely bottlenecked by gemm, there should be no performance difference. I observed this with my GPU version and x86 avx2 version. For x86 sse target, my minimal version is consistently faster than non minimal one. The above two scripts will benchmark with sse target. I have tested on recent CPUs (Coffee lake) and old high core count Xeon (12-16 core, Sandybridge and Nehalem). On recent CPUs difference is small. On old Xeon, I don't think LLVM can do this non trivial common subexpression elimination. Even if LLVM can detect common subexpressions, it is not supposed to eliminate them I believe, because this is float ops. |
Thanks for the explanation! We can keep the non minimal version for ARM CPU in this PR, since it is more readable. |
yes, you can follow up with another PR if you find a way to improve performance later. Let's merge this first. |
|
||
'ndk': use Android NDK to create shared library. Use this for android target. | ||
|
||
callable; customized build function for other backends (e.g. VTA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see measure/measure_methods.py default_build_func for example
|
||
'local-nofork': use local device for measure but does not use multiprocessing. | ||
This mode is suitable for debug, but does not support timeout and parallel. | ||
callable: It is a customized function for measurement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see measure/measure_methods.py measure_rpc for example
apps/benchmark/README.md
Outdated
|
||
If your device has a same SoC of the above device, you can reuse these parameters | ||
(e.g. use `llvm -device=arm_cpu -mode=rk3399 -target=aarch64-linux-gnu` as target). | ||
Otherwise, you need to tune for your own device, please follow this [tutorial](please_fix_this_later.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this or remove for now?
python/tvm/autotvm/tophub.py
Outdated
|
||
AUTOTVM_TOPHUB_ROOT_PATH = os.path.join(os.path.expanduser('~'), ".tvm", "tophub") | ||
|
||
def load_context(target, rootpath=AUTOTVM_TOPHUB_ROOT_PATH): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is still better to allow the with block
with tophub.context(target):
my code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to allow user also specify their customized location of tunning logs?
python/tvm/autotvm/tophub.py
Outdated
""" | ||
TopHub: Tensor Operator Hub | ||
To get the best performance, we typically need auto-tuning for the specific devices. | ||
TVM releases pre-tuned parameters in TopHub (https://github.com/uwsaml/tvm-distro) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since tvm-distro location can change, do not use url for now
python/tvm/autotvm/tophub.py
Outdated
""" | ||
path = tempdir() | ||
filename = path.relpath("info.json") | ||
print("Download meta info for pre-tuned parameters") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use logging instead of print
nnvm/src/top/nn/convolution.cc
Outdated
@@ -130,11 +130,110 @@ inline bool Conv2DInferShape(const nnvm::NodeAttrs& attrs, | |||
return true; | |||
} | |||
|
|||
inline bool WinogradConv2DInferShape(const nnvm::NodeAttrs& attrs, | |||
std::vector<TShape>* in_shape, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argument alignment
@@ -101,20 +101,29 @@ def measure_option(mode, | |||
The number of measurement task that can run in parallel. | |||
Set this according to the number of cpu cores (for compilation) and | |||
the number of devices you have (for measuring generate code). | |||
do_fork: bool, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Principle one of interface design is to simplify and hide options user do not use, in this case, do_fork is only used in local. I think we should remove this, and allow user to pass in
measure_func = autotvm.measure.local_nofork(measure args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly, if pack_size, rpc_device_key etc are only arguments to rpc. I think we should have good default, and allow user to do
measure_func = autotvm.measure.rpc_(rpc_key=xxxx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_fork
is used in local_executor not measure_func. It can be used in any mode.
build_func='default', | ||
|
||
replay_db=None, | ||
save_to_replay_db=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can save_to_display_db become an optional callback function?
need to rebase against the master |
I am doing some refactor. Do not merge. |
Is it worth preserving the function |
@merrymercy can you add tvm.target.rasp() as per comment by @ajtulloch ? |
I noticed that there are some things deleted; are we removing |
@ajtulloch |
@merrymercy Have we updated the related docs? I get your PR code into tvm/master and follow the tutorial https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_arm.html, but I find that I can not train and get this information: BTW, I register one remote device named |
Thanks @merrymercy @masahi @ajtulloch @eqy @FrozenGene This is merged |
@FrozenGene can you open an discuss thread in the https://discuss.tvm.ai/ so we can followup the discussion? |
|
||
new_attrs = {k: attrs[k] for k in attrs.keys()} | ||
|
||
assert attrs.get_int_tuple("dilation") == (1, 1), "Does not support dilation " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we have merged it. But when I ran one model today, I find we can have better mechanism @merrymercy . Move this line after line 491.
assert attrs.get_int_tuple("dilation") == (1, 1), "Does not support dilation " \
"when alter_op_layout is enabled"
(if groups == 1)
Because we will not change the kernel layout for depthwise conv2d.
Or we can support it in compute_conv2d function use topi.nn.dialate(inputs[1], (1, 1, dialate_h, dialate_w, 1).
This PR includes:
benchmark results