Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tensorize][runtime] Add support for AMX(Advanced Matrix Extensions) through Tensor intrinsics #13642

Merged
merged 22 commits into from
Jan 5, 2023

Conversation

Qianshui-Jiang
Copy link
Contributor

@Qianshui-Jiang Qianshui-Jiang commented Dec 19, 2022

In this PR, we add support for AMX(Advanced Matrix Extensions) through Tensor intrinsics;

we chose to enabled the AMX support and INT8 intrinsics and dense op at first, which includes:

  1. Add building options and LLVM AMX intrinsic target detection ;
  2. Register a global AMX init and config function;
  3. LLVM AMX intrinsic integrated as Tensor intrinsic (1 for microkernel computation and 1 for accumulation tile store);
  4. Add test case on u8s8s32 matmul usding the AMX Tensor intrinsics;
  5. Integration of int8 dense kernel, testcase also added;

BF16 and more op are going to be considered in the future.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Dec 19, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@Qianshui-Jiang Qianshui-Jiang changed the title [Tensorize][TOPI] Add support for Intel® AMX(Advanced Matrix Extensions) through Tensor intrinsics [Tensorize][runtime] Add support for AMX(Advanced Matrix Extensions) through Tensor intrinsics Dec 19, 2022
@masahi masahi self-assigned this Dec 19, 2022
Copy link
Contributor

@cbalint13 cbalint13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Qianshui-Jiang for this latest x86 SIMD feature addition !

python/tvm/topi/x86/dense_alter_op.py Outdated Show resolved Hide resolved
int rows = args[0];
int cols = args[1];
LOG(INFO) << "rows: " << rows << ", cols:" << cols;
// -----------Config for AMX tile resgister----------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s|resgister|register|

Copy link
Contributor

@cbalint13 cbalint13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@Qianshui-Jiang , just curiousity (offtopic):
Do you have a TVM number for AMX performance (e.g. vs. VNNI) ?

@Qianshui-Jiang
Copy link
Contributor Author

Qianshui-Jiang commented Dec 26, 2022

@cbalint13 Thanks for your curiousity!
In our test, Compared to TVM testcase test_fc_int8_acc32(), AMX int8 dense performes about 8X faster in single physical core.
(Since AMX machine not officially lunched yet, I can only gives a little perf data, but it's coming soon~)

@masahi
Copy link
Member

masahi commented Dec 27, 2022

@Qianshui-Jiang Can we test TVM-generated AMX code using the emulator (SDE)?

@Qianshui-Jiang
Copy link
Contributor Author

@masahi Sure, you can use SDE by assign platform info like -spr,
But if you are not built on 5.16 kernel, you have to comment all the system related code runtime.amx_init ,
and don't need to init amx before the testcase when using SDE;

@Qianshui-Jiang
Copy link
Contributor Author

@masahi, BTW, SDE would be very slow, the repeat times of time_evaluator should be small.

python/tvm/topi/x86/dense.py Outdated Show resolved Hide resolved
python/tvm/topi/x86/dense.py Show resolved Hide resolved
python/tvm/topi/x86/tensor_intrin.py Show resolved Hide resolved
Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -293,16 +324,13 @@ def dense_vnni_compute(cfg, X, packed_w, bias=None):
),
axis=ak,
),
tag="dense_vnni",
attrs={"schedule_rule": "dense_vnni"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be removed, it is used here

register_func("meta_schedule.x86.dense_vnni", schedule_rule_dense_vnni)
.

Since this only affects MetaSchedule, you don't have to provide this value for AMX. So only when dense_int8_compute is called for VNNI, you need to provide this attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi Is this case only use the x86 int8 compute method and inject a particular TIR scheduling? Can we just change the attribute dense_vnni to dense_int8 which used here?

Copy link
Member

@masahi masahi Jan 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it is important that we specify that we use this compute for VNNI. If the schedule rule annotation only says "dense_int8", we don't know which intrinsic to tensorize this compute with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be removed, it is used here

register_func("meta_schedule.x86.dense_vnni", schedule_rule_dense_vnni)

.
Since this only affects MetaSchedule, you don't have to provide this value for AMX. So only when dense_int8_compute is called for VNNI, you need to provide this attribute.

@masahi Sorry, may given the misunderstanding, I mean that can we use the dense_int8 in this test case? cuz here inject the VNNI intrinsic explicitly.
And by default in relay build flow, it will check if the VNNI or AMX is availiable and chose different schedulling.
I've verified that this test case still functional after the little modification in this commit bellow

Copy link
Member

@masahi masahi Jan 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course the test still works, because it was already written for VNNI. The point is that the name dense_vnni tells that the particular TE expression is meant for VNNI tensorization, in particular the weight is pre-packed appropriately. But a generic name like dense_int8 doesn't provide such information. Just because AMX can use the same layout doesn't mean we can use dense_int8. If MetaSchedule finds a compute annotated with dense_int8, it cannot tell if it should apply VNNI or AMX tensorization (if the latter is supported by MS in the future).

So please revert that commit and restore and pass attrs={"schedule_rule": "meta_schedule.x86.dense_vnni"} when you create a compute expression for VNNI. schedule_rule is not relevant for AMX for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi yep, got it, the schedule rule for ms dense_vnni are restored.
Keep it remained but AMX not use it for now.
( A smalll question: How do we know inside of this compute expression if it's created for VNNI not AMX? )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest checking the target string in the op strategy, and create separate compute for VNNI or AMX (rather than using the same function, dense_int8, for both).

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes in app/ directory intended?

@@ -1 +0,0 @@
qemu-system-i386
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change relevant? If not remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi by misoperation, already removed.

python/tvm/relay/op/strategy/x86.py Show resolved Hide resolved
@masahi masahi merged commit 07a5a9e into apache:main Jan 5, 2023
fzi-peccia pushed a commit to fzi-peccia/tvm that referenced this pull request Mar 27, 2023
…through Tensor intrinsics (apache#13642)

* add AMX config functions and building option.

* amx tensor intrinsics and u8s8s32 matmul testcase

* add int8 dense kernel use amx tensorize

* add int8 dense kernel use amx tensorize

* add amx init() and config() for dense test case

* correct the amx config

* fix lint.

* fix dense schedule

* remove operation of signal stack

* fix nit

* unified amx and vnni compute, remove dup one

* fix lint

* adopt to x86 int8 dense compute method;

* Revert "adopt to x86 int8 dense compute method;"

This reverts commit 5718a05.

* restore schedule ruls specially for ms dense_vnni

* add vnni ms target attributes

* remove the misoperations

* Revert "restore schedule ruls specially for ms dense_vnni"

This reverts commit 2bda03e.

* add vnni ms target attributes and remove misops

* Revert "add vnni ms target attributes"

This reverts commit c2e9f26.

* remove the misops
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants