Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AutoScheduler] Add a tutorial on auto-scheduling a network for x86 CPU #7019

Merged
merged 11 commits into from
Dec 3, 2020

Conversation

merrymercy
Copy link
Member

@merrymercy merrymercy commented Dec 3, 2020

Add a tutorial on auto-scheduling a network for CPU.

With #6987 #6903, we can get good performance and fast tuning speed for CNN on CPU now.
I will upstream more optimizations for winograd conv2d, conv3d, matmul in follow up PRs.

@merrymercy merrymercy changed the title [AutoScheduler] Add tutorial on auto-scheduling a network for CPU [AutoScheduler] Add a tutorial on auto-scheduling a network for CPU Dec 3, 2020
@merrymercy
Copy link
Member Author

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just nits.

tutorials/auto_scheduler/tune_network_x86.py Outdated Show resolved Hide resolved
# -------------------------------------------------
# | 0 | 0.010 | 0.40 | 64 |
# | 1 | 0.087 | 47.19 | 64 |
# | 2 | 0.008 | -0.00 | 64 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @masahi pointed out in the forum, it would be better to explain why we got -0.00 for this task.

@merrymercy merrymercy changed the title [AutoScheduler] Add a tutorial on auto-scheduling a network for CPU [AutoScheduler] Add a tutorial on auto-scheduling a network for x96 CPU Dec 3, 2020
@merrymercy merrymercy changed the title [AutoScheduler] Add a tutorial on auto-scheduling a network for x96 CPU [AutoScheduler] Add a tutorial on auto-scheduling a network for x86 CPU Dec 3, 2020
We then use the auto-scheduler to construct a search space of this DAG and search
for good schedules (low-level optimizations).

Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence a little difficult to read; perhaps go with : "The autoscheduler does not require any schedule templates. Therefore it greatly improves upon the template-based autoTVM..."

# correctly with any layout, we found the best performance is typically
# achieved with NHWC layout. We also implemented more optimizations for
# NHWC layouts with the auto-scheduler.
# So it is recommended to convert your models to NHWC layout to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: is it recommended or mandatory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended. Auto-scheduler can work correctly with any layout. But the performance for NCHW is just not guaranteed.

# the auto-scheduler.


def get_network(name, batch_size, layout="NHWC", dtype="float32"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note on the restriction: while relay.testing library is really convenient, not all models offer the choice to change the layout (VGG). In addition, many importers are fixed layout. It would greatly benefit this tutorial if we showed how to transform the layout of a whole graph that is NCHW since many folks will hit this limitation coming from MxNet, Pytorch, ONNX etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I will add a link to the convert layout pass

# Extract tasks from the network
print("Extract tasks...")
mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another quality of life improvement here would be to error out if the tasks are for NCHW layout in which case no tasks would get extracted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-scheduler can work with any layout. For NCHW, it can correctly extract tasks and tune them.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @merrymercy for the tutorial, this is excellent! I left some comments / questions to address.

Copy link
Member

@FrozenGene FrozenGene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. A bit nitty comments

@@ -170,11 +172,11 @@ def get_network(name, batch_size, layout="NHWC", dtype="float32"):
# Typically, we recommend a value >= 300 ms.
# * :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.
# You can set it to a small number (e.g., 200) for a fast demonstrative run.
# In practice, we recommend setting it around :code:`1000 * len(tasks)`,
# In practice, we recommend setting it around :code:`900 * len(tasks)`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason changing 1000 to 900? Any experiment or principle?

#
# * :code:`num_measure_trials` is the number of measurement trials we can use during the tuning.
# You can set it to a small number (e.g., 200) for a fast demonstrative run.
# In practice, we recommend setting it around :code:`800 * len(tasks)`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is 800 now. GPU is 900. I think it is a bit confused. If there is no special reason, unified them into 1000?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU has a larger search space so it should use a larger value.
1000 is typically too much.

@merrymercy merrymercy merged commit 3afde62 into apache:main Dec 3, 2020
@merrymercy merrymercy deleted the pr-cpu-tutorial branch December 3, 2020 14:17
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 3, 2020
…PU (apache#7019)

* [AutoScheduler] Add tutorial on auto-scheduling a network for CPU

* update

* update

* update

* improve

* improve

* address comments

* add help on layout conversion

* add help for layout conversion

* update target string

* update cuda logs
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
…PU (apache#7019)

* [AutoScheduler] Add tutorial on auto-scheduling a network for CPU

* update

* update

* update

* improve

* improve

* address comments

* add help on layout conversion

* add help for layout conversion

* update target string

* update cuda logs
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
…PU (apache#7019)

* [AutoScheduler] Add tutorial on auto-scheduling a network for CPU

* update

* update

* update

* improve

* improve

* address comments

* add help on layout conversion

* add help for layout conversion

* update target string

* update cuda logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants