[AutoScheduler] Add a tutorial on auto-scheduling a network for x86 CPU #7019

merrymercy · 2020-12-03T02:17:59Z

Add a tutorial on auto-scheduling a network for CPU.

With #6987 #6903, we can get good performance and fast tuning speed for CNN on CPU now.
I will upstream more optimizations for winograd conv2d, conv3d, matmul in follow up PRs.

merrymercy · 2020-12-03T02:36:05Z

cc @tqchen @junrushao1994 @jcf94 @minminsun

comaniac

LGTM. Just nits.

tutorials/auto_scheduler/tune_network_x86.py

comaniac · 2020-12-03T02:35:51Z

tutorials/auto_scheduler/tune_network_x86.py

+#     -------------------------------------------------
+#     |    0 |        0.010 |           0.40 |     64 |
+#     |    1 |        0.087 |          47.19 |     64 |
+#     |    2 |        0.008 |          -0.00 |     64 |


As @masahi pointed out in the forum, it would be better to explain why we got -0.00 for this task.

tmoreau89 · 2020-12-03T04:03:45Z

tutorials/auto_scheduler/tune_network_x86.py

+We then use the auto-scheduler to construct a search space of this DAG and search
+for good schedules (low-level optimizations).
+
+Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on


Sentence a little difficult to read; perhaps go with : "The autoscheduler does not require any schedule templates. Therefore it greatly improves upon the template-based autoTVM..."

tutorials/auto_scheduler/tune_network_x86.py

tmoreau89 · 2020-12-03T04:05:26Z

tutorials/auto_scheduler/tune_network_x86.py

+# correctly with any layout, we found the best performance is typically
+# achieved with NHWC layout. We also implemented more optimizations for
+# NHWC layouts with the auto-scheduler.
+# So it is recommended to convert your models to NHWC layout to use


Note: is it recommended or mandatory?

It is recommended. Auto-scheduler can work correctly with any layout. But the performance for NCHW is just not guaranteed.

tmoreau89 · 2020-12-03T04:07:26Z

tutorials/auto_scheduler/tune_network_x86.py

+# the auto-scheduler.
+
+
+def get_network(name, batch_size, layout="NHWC", dtype="float32"):


Note on the restriction: while relay.testing library is really convenient, not all models offer the choice to change the layout (VGG). In addition, many importers are fixed layout. It would greatly benefit this tutorial if we showed how to transform the layout of a whole graph that is NCHW since many folks will hit this limitation coming from MxNet, Pytorch, ONNX etc.

Yeah. I will add a link to the convert layout pass

tmoreau89 · 2020-12-03T04:09:31Z

tutorials/auto_scheduler/tune_network_x86.py

+# Extract tasks from the network
+print("Extract tasks...")
+mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
+tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)


Another quality of life improvement here would be to error out if the tasks are for NCHW layout in which case no tasks would get extracted.

Auto-scheduler can work with any layout. For NCHW, it can correctly extract tasks and tune them.

tutorials/auto_scheduler/tune_network_x86.py

tmoreau89

Thank you @merrymercy for the tutorial, this is excellent! I left some comments / questions to address.

FrozenGene

Overall LGTM. A bit nitty comments

FrozenGene · 2020-12-03T06:21:45Z

tutorials/auto_scheduler/tune_network_cuda.py

@@ -170,11 +172,11 @@ def get_network(name, batch_size, layout="NHWC", dtype="float32"):
 #   Typically, we recommend a value >= 300 ms.
 # * :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.
 #   You can set it to a small number (e.g., 200) for a fast demonstrative run.
-#   In practice, we recommend setting it around :code:`1000 * len(tasks)`,
+#   In practice, we recommend setting it around :code:`900 * len(tasks)`,


What is the reason changing 1000 to 900? Any experiment or principle?

FrozenGene · 2020-12-03T06:23:22Z

tutorials/auto_scheduler/tune_network_x86.py

+#
+# * :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.
+#   You can set it to a small number (e.g., 200) for a fast demonstrative run.
+#   In practice, we recommend setting it around :code:`800 * len(tasks)`,


Here is 800 now. GPU is 900. I think it is a bit confused. If there is no special reason, unified them into 1000?

GPU has a larger search space so it should use a larger value.
1000 is typically too much.

…PU (apache#7019) * [AutoScheduler] Add tutorial on auto-scheduling a network for CPU * update * update * update * improve * improve * address comments * add help on layout conversion * add help for layout conversion * update target string * update cuda logs

[AutoScheduler] Add tutorial on auto-scheduling a network for CPU

d0cc9b6

merrymercy changed the title ~~[AutoScheduler] Add tutorial on auto-scheduling a network for CPU~~ [AutoScheduler] Add a tutorial on auto-scheduling a network for CPU Dec 3, 2020

merrymercy added 2 commits December 3, 2020 02:30

update

23d393a

update

94def95

merrymercy requested review from comaniac and FrozenGene December 3, 2020 02:35

comaniac approved these changes Dec 3, 2020

View reviewed changes

merrymercy added 3 commits December 3, 2020 02:39

update

58e4578

improve

7677716

improve

a8d3a7d

merrymercy changed the title ~~[AutoScheduler] Add a tutorial on auto-scheduling a network for CPU~~ [AutoScheduler] Add a tutorial on auto-scheduling a network for x96 CPU Dec 3, 2020

merrymercy changed the title ~~[AutoScheduler] Add a tutorial on auto-scheduling a network for x96 CPU~~ [AutoScheduler] Add a tutorial on auto-scheduling a network for x86 CPU Dec 3, 2020

tmoreau89 reviewed Dec 3, 2020

View reviewed changes

tutorials/auto_scheduler/tune_network_x86.py Outdated Show resolved Hide resolved

tmoreau89 reviewed Dec 3, 2020

View reviewed changes

tutorials/auto_scheduler/tune_network_x86.py Outdated Show resolved Hide resolved

tmoreau89 reviewed Dec 3, 2020

View reviewed changes

address comments

e55e477

FrozenGene approved these changes Dec 3, 2020

View reviewed changes

merrymercy added 4 commits December 2, 2020 22:26

add help on layout conversion

03518df

add help for layout conversion

361d489

update target string

905d1ab

update cuda logs

bafbcb7

merrymercy merged commit 3afde62 into apache:main Dec 3, 2020

merrymercy deleted the pr-cpu-tutorial branch December 3, 2020 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoScheduler] Add a tutorial on auto-scheduling a network for x86 CPU #7019

[AutoScheduler] Add a tutorial on auto-scheduling a network for x86 CPU #7019

merrymercy commented Dec 3, 2020 •

edited

Loading

merrymercy commented Dec 3, 2020

comaniac left a comment

comaniac Dec 3, 2020

tmoreau89 Dec 3, 2020

tmoreau89 Dec 3, 2020

merrymercy Dec 3, 2020

tmoreau89 Dec 3, 2020

merrymercy Dec 3, 2020

tmoreau89 Dec 3, 2020

merrymercy Dec 3, 2020

tmoreau89 left a comment

FrozenGene left a comment

FrozenGene Dec 3, 2020

FrozenGene Dec 3, 2020

merrymercy Dec 3, 2020

		# the auto-scheduler.


		def get_network(name, batch_size, layout="NHWC", dtype="float32"):

[AutoScheduler] Add a tutorial on auto-scheduling a network for x86 CPU #7019

[AutoScheduler] Add a tutorial on auto-scheduling a network for x86 CPU #7019

Conversation

merrymercy commented Dec 3, 2020 • edited Loading

merrymercy commented Dec 3, 2020

comaniac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tmoreau89 left a comment

Choose a reason for hiding this comment

FrozenGene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merrymercy commented Dec 3, 2020 •

edited

Loading