Skip to content

a computing kernel implementation in ML inference framework aiming at theoretical limit

License

Notifications You must be signed in to change notification settings

songqun/speedup-aarch64-cpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

speedup-aarch64-cpu

A computing kernel implementation in ML inference framework aiming at theoretical limit on ARMv8 CPU in single thread. It will mainly aim at Cortex-A55 and Cortex-A76.

The basic supported data type is for float32. But it will support float16 and int8. On Cortex-A55 and Cortex-A76, they support vec-inst fmla in float16 and sdot/udot in int8, not only in storage but also in computation. Thus the computation peak of float16 and int8 will be 2 and 4 times than float32 on Cortex-A55 and Cortex-A76.

The data layout is packed by vec-reg length in the whole project, which means NCHWc4 in float32, NCHWc8 in float16, and NCHWc16 in int8.

The project will first implement ConvOp because ConvOp will occupy most time in inference (50% to 80%). ConvOp will be implemented in multiple ways/algorithms in order to optimize different sizes.

Now, there are many open-source inference frameworks such as NCNN, MNN, and TVM. By my profiler, these frameworks only reach about 50% peak on Cortex-A55, and 70% peak on Cortex-A76 in ConvOp in single thread. Also they do not support float16.

The aim of this project is to exceed them and reach 70% to 80% on Cortex-A55, and 80% to 90% on Cortex-A76, and supports float16.

The test platform is RedMi 7 Pro, with Qualcomm Snapdragon 675, which is almost the cheapest SoC with Cortex-A55 and Cortex-76 now...

The frequency of A55 is 1.66 GHz and A76 is 2.00 GHz, tested by continual independent FMLA in test_blk.cpp.

Benchmark

Benchmarks will be updated in benchmark folder.

It includes table comparing time between the project and other open-source inference framework in ConvOp. The size of ConvOp now is based on ResNet50. I will test more further like Squeezenet, GoogLenet, vgg...

After depthwise-ConvOp being implementd, the benchmark will add MobileNetV2.

The summary of the benchmark is shown below, based on fp32:

conv type ih iw ic oc theoretical time on A55 my time on A55 percent % NCNN time on A55 MNN time on A55
conv1x1s1p0 56 56 64 256 7.517 10.841 69 14.70 19.826
56 56 256 64 7.517 11.279 67 17.87 19.85
28 28 128 512 7.517 10.44 72 12.74 16.232
28 28 512 128 7.517 12.26 61 15.37 18.617
14 14 256 1024 7.517 9.887 76 12.97 16.304
14 14 1024 256 7.517 11.98 63 15.81 22.696
7 7 512 2048 7.517 11.54 65 13.33 19.045
7 7 2048 512 7.517 12.262 61 13.84 22.407
conv3x3s1p1 56 56 64 64 16.913 15.17 111 23.50 18.736
28 28 128 128 16.913 11.509 147 12.89 23.006
14 14 256 256 16.913 13.816 122 20.67 22.703
7 7 512 512 16.913 15.25 111 40.13 30.391

I compare my kernel with popular used and fast inference framework on arm: NCNN and MNN. I'm faster on each size of ConvOp.

I will add comparison with tf-lite, pytorch-lite further. In 'some' common sense, these two are much slower than NCNN and MNN.

Paddle-lite is another very faster inference framework and I will add comparison with it further.

Note that conv1x1s1p0 uses conv_im2col (though in fact no need to do im2col, just GEMM) and conv3x3s1p1 uses conv_wino(winograd algorithm to reduce computation amount) with the same SGEMM kernel. The idea percent of conv_im2col is 70-80% and conv_wino is 120-150% on little core.

Until now, I reach 60-70% of peak performance on little core (A55), close to my aim.

The further optimization idea is to implement sgemm12x8 kernel using 32 vec-reg fully. I predict that I will reach 65-75% of peak performance after that.

TODO

Urgent:

  • Optimization on A76 and do comparison
  • implement sgemm12x8 using 32 vec-reg fully.

Further:

  • Add more detail descriptions how to write a high-performance kernel.
  • Comparison with Paddle-lite, tf-lite, pytorch-lite, even TVM.
  • Transform the whole project into C...It only use iostream of CPP now...
  • Support other hardware like Vulkan, X86, CUDA...build the speedup world !
  • Add float16, int8.
  • Support multi-threading.

How to compile

The project is based on CMake newer than 3.0. The simplest way to compile it is:

mkdir build
cd build/
cmake ..
make -j8

For cross-compiling onto aarch64, the simplest way is following NCNN's way. The key is to download cross-compiler, write a crosscompiler.toolchain.cmake like NCNN's, and set:

cmake -DCMAKE_TOOLCHAIN_FILE=${PATH}/crosscompiler.toolchain.cmake ..

float16 needs very high version of gcc and clang.

General Opimization Idea

  • SIMD
  • Packing
  • Blocking
  • Instruction Reorder

TODO: More description in detail.

Convolution Algorithm

  • im2col
  • winograd
  • direct (not implement)

TODO: More description in detail.

The header conv.hpp also includes some details.

Trick

On Cortex-A55, q-form fmla and ldr cannot be dual issue, so instruction reorder cannot fully hide ldr with fmla.
However, I find a trick to split q-form ldr into 3 instructions and all of them can be hidden by fmla.
So we need a computing kernel with computing / load ratio > 3.

The kernel in this project is

sgemm_8x8: 8*K matrix multiples K*8 matrix.
Computing / load ratio is: (2(4*1 weight)*8(1 in)) / (2(4*1 in) + 2(4*1 weight)) = 4 > 3

About

a computing kernel implementation in ML inference framework aiming at theoretical limit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published