-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ARM32 kernel implementation #432
Conversation
Thank you for the great PR!
It looks like linting fails on CI. We use
should make linting on CI happy. We run our ARM tests on CI using
After which you should be able to run the ARM32 unittests in the emulator:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution, this is brilliant :)
From taking a first look, I have a few suggestions below.
Thanks all for your suggestions. I will review and make a new PR |
👍 Thanks for taking a look. |
@honglh Thanks for addressing the comments. It looks like some of the kernel tests for the optimized code are still failing on CI, could you take a look at them? |
@lgeiger I have reproduced the ARM test failure on my local setup. I am looking into this issue |
With change in bconv2d_test.cc the test execution
passed in my local setup. The earlier CI failure seems caused by
This mismatch (64-bit bitpacking and 32-bit bgemm kernel) caused CI failures. Another failure case is from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to see the tests passing now!
My only concern is that I think we still want to run the "BConv2D32OPT"
, "BConv2D64OPT"
, and "BConv2D32REF"
tests on all architectures. For example, when the tests are running on Arm32, we still want the "BConv2D64OPT"
tests to successfully run (falling back to unoptimised C++ kernels) and the tests to pass.
I did some local debugging and was able to get this working with the two small changes I've listed below. It turns out that the reason for the previous test failures on uint64
inputs was that the bgemm_runtime_path
was getting set to ruy::Path::kNeon
no matter what, and on Arm32 there's no Neon kernel for uint64
inputs. This meant that no bgemm kernel was run at all, and for some reason no error was thrown.
I also have a few whitespace-related suggested changes for a couple of comments, but those are very minor and I'll list them in a separate review.
After making these changes this looks good to merge from my perspective 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whitespace suggestions I mentioned above:
Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
conform to clang format Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
conform to clang format Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
conform to clang format Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
conform to clang format Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
Co-authored-by: Adam Hillier <7688302+AdamHillier@users.noreply.github.com>
@AdamHillier Thanks. The changes look good on my side as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, this is fantastic 🔥
Thank you so much for doing this @honglh.
What do these changes do?
The BgemmKernel<path=kNeon, LhsScalar=uint32_t, RhsScalar=uint32_t, AccumScalar=int32_t, DstScalar=float> implementation for ARM32 device.
How Has This Been Tested?
The same code change is tested on v0.3.1 tag on RPI2 and RPI3 (32-bit). But I cannot test on the master branch for open issue #408 . Though I tested on v0.3.1 tag I cannot create PR request for tag.
Benchmark Results
On RPI2 inference time is about 100 to 200ms and RPI3 it is 50 to 100ms. (The inference time is 500ms+ without using the kernel).
Related issue number
#431