Pool Android performance and GPU not used at all when built with OpenCL #2052

JianbangZ · 2023-06-29T18:40:30Z

I have a phone with snapdragon 8 gen 2 (best snapdrgon chip), and have been trying to make llama.cpp work through Termux.
I followed the compiling instructions exactly. What I found is below
(1) Method 1: Normal
$ mkdir build-android
$ cd build-android
$ export NDK=<your_ndk_directory>
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make

The inference speed is extremely slow, like 4 second/token

(2) Method 2: OpenCL
I follow the steps and use "make LLAMA_CLBLAST=1"
Inference speed is much faster, but still only 250 ms/token.
When I check the CPU usage using "adb shell -> top", I found only CPU is used. 400% usage.

(3) I use another Android project https://mlc.ai/mlc-llm/#android, used their prebuilt APK with vicunna 7B 4bit model, the infernce speed is 16 tokens/s AKA 60 ms/token, and CPU usage is only 70%, so seems GPU is correctly used for mlc-llm project.

ghost · 2023-06-29T19:47:34Z

(2) Method 2: OpenCL I follow the steps and use "make LLAMA_CLBLAST=1" Inference speed is much faster, but still only 250 ms/token.

To enable GPU acceleration then use the -ngl # parameter. Adjust CPU usage by modifing the --threads # parameter in ./main

Changing these parameters isn't gonna produce 60ms/token though - I'd love if llama.cpp fully utilised Android GPU, but Offloading to GPU decreases performance for me.

It would be great if whatever they're doing is converted for llama.cpp to utilise Android GPU.

JianbangZ · 2023-06-29T20:25:07Z

(2) Method 2: OpenCL I follow the steps and use "make LLAMA_CLBLAST=1" Inference speed is much faster, but still only 250 ms/token.

To enable GPU acceleration then use the -ngl # parameter. Adjust CPU usage by modifing the --threads # parameter in ./main

Changing these parameters isn't gonna produce 60ms/token though - I'd love if llama.cpp fully utilised Android GPU, but Offloading to GPU decreases performance for me.

It would be great if whatever they're doing is converted for llama.cpp to fully utilise the GPU.

Tried -ngl with different numbers, it makes performance worse

FNsi · 2023-06-29T22:43:59Z

Is it because termux itself?
I asked because I can't play videos well by Firefox which installed in termux Ubuntu.

Try

pkg i virglrenderer-android

ghost · 2023-07-05T13:46:31Z

Is it because termux itself? I asked because I can't play videos well by Firefox which installed in termux Ubuntu.

Try
pkg i virglrenderer-android

To clarify, you use Termux to install Firefox? No, I don't use Termux like that. I would install the Firefox apk on Android in the same way Termux is installed.

The point is @JianbangZ installs mlc.apk, running 7B models at 60 ms/token whereas llama.cpp is slower.

FNsi · 2023-07-05T13:51:46Z

pkg i virglrenderer-android

To clarify, you use Termux to install Firefox? No, I don't use Termux like that. I would install the Firefox apk on Android in the same way Termux is installed.

The point is @JianbangZ installs mlc.apk, running 7B models at 60 ms/token whereas llama.cpp is slower.

And I was saying he need install gpu drivers in termux.

Btw, I use termux with Firefox because I am using vr devices and I want a good environment to do anything.

JianbangZ · 2023-07-05T13:52:32Z

I should correct myself that 60 ms/t is the speed for prefill, the later decode speed is roughly 130 ms/t, but still, it's 2x faster than the current llama.cpp when built with OpenCL. I think it's because mlc fully uses Vulkan.

JianbangZ · 2023-07-05T13:55:13Z

pkg i virglrenderer-android

To clarify, you use Termux to install Firefox? No, I don't use Termux like that. I would install the Firefox apk on Android in the same way Termux is installed.
The point is @JianbangZ installs mlc.apk, running 7B models at 60 ms/token whereas llama.cpp is slower.

And I was saying he need install gpu drivers in termux.

Btw, I use termux with Firefox because I am using vr devices and I want a good environment to do anything.

I didn't know you have to or can install GPU driver in Termux? can you claborate? the GPU is just Adreno, which is pretty standard in Snapdragon chip line.

ghost · 2023-07-05T14:04:50Z

I think it's because mlc fully uses Vulkan.

#2059 is @0cc4m's work toward vulkan implementation.

Supposedly, mlc.apk source is available - I wonder if there's anything in the source for llama.cpp to improve GPU acceleration on Android.

FNsi · 2023-07-05T14:28:50Z

the GPU is just Adreno, which is pretty standard in Snapdragon chip line.

I don't think so. Termux is only a simulator, and you have to install the specific driver to using the physical GPU.

guide-running-linux-on-android-with-3d-acceleration-opengl

ghost · 2023-07-05T14:41:15Z

the GPU is just Adreno, which is pretty standard in Snapdragon chip line.

I don't think so. Termux is only a simulator, and you have to install the specific driver to using the physical GPU.

The issue is GPU acceleration with OpenCL. OpenGL is not currently available in llama.cpp

FNsi · 2023-07-05T14:51:30Z

The issue is GPU acceleration with OpenCL. OpenGL is not currently available in llama.cpp

I have to agree I moved too far to discuss termux.😂

AlphaAtlas · 2023-07-05T18:35:31Z

I think it's because mlc fully uses Vulkan.

#2059 is @0cc4m's work toward vulkan implementation.

Supposedly, mlc.apk source is available - I wonder if there's anything in the source for llama.cpp to improve GPU acceleration on Android.

The source is pretty simple, and lots of it is the "front end" for the chat.

Much of the performance comes from TVM's profiling/autotuning... it more than doubles the Vulkan performance in my tests.

A TVM backend for llama.cpp might not even be that crazy? Especially if yall start exporting the graphs in generic way. TVM supports really esoteric hardware/backends as targets (though MLC only has performance profiles for llama 7B and vulkan/metal last time I checked).

JianbangZ · 2023-07-10T15:09:41Z

@JackJollimore Just a quick update
I ended up compiling inside the phone (snapdragon 8gen2), which is faster. If I just do cmake .., q4_0 model gives 4.2 t/s, if I add DCMAKE_C_FLAGS=-march=armv8.4a (dotprod is enabled be default for 8.4a), it gives 5.5 t/s, if I use q4_k_s model, it's close to 7 t/s. These are all on CPU. But I still haven't figured out how to enable v8.4+(v8.5a, v8.7a , v9a) or SVE, all gives illegal instruction.
In general I think the gen 1/2 Cortex-X2/3 core is giving some troubles

ghost · 2023-07-10T18:56:17Z

If I just do cmake .., q4_0 model gives 4.2 t/s, if I add DCMAKE_C_FLAGS=-march=armv8.4a (dotprod is enabled be default for 8.4a), it gives 5.5 t/s, if I use q4_k_s model, it's close to 7 t/s. These are all on CPU. But I still haven't figured out how to enable v8.4+(v8.5a, v8.7a , v9a) or SVE, all gives illegal instruction. In general I think the gen 1/2 Cortex-X2/3 core is giving some troubles

Thanks for the update. I always compile on my device, but only CMake'd for OpenCL builds. I'll CMake some regular builds tonight, along with some of these flags and see how it goes.

On another note: OpenCL build produces excessively high perplexity, which is unfortunate, but at least it's a reported issue: #2133 (comment)

AlphaAtlas · 2023-07-10T19:33:03Z

@JianbangZ 🤔 You mean march=native isn't enabling everything that's supported?

What version of llvm/clang is termux using?

ghost · 2023-07-10T22:22:55Z

@AlphaAtlas
Here's llvm & clang info. in Termux:

clang version 16.0.6
Target: aarch64-unknown-linux-android24
Thread model: posix
InstalledDir: /data/data/com.termux/files/usr/bin

llvm/stable,now 16.0.6-1 aarch64

I tested a couple different CMake flags. It's interesting that the second build was slightly faster.

Test #1: cmake -B build -DCMAKE_C_FLAGS=-march=native

./main -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin -b 7 -i -ins  main: build = 0 (unknown)                         main: seed  = 1689024887           
...
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> hi what's your name?
Hi, I am John.
> Hi john. please tell me a story about a mutant llama that took over Earth.
 Certainly! Once upon a time, in a far-off land called Bolivia, there lived a rare and unusual creature known as the llama.
 Llamas were typically docile creatures, but one day, a mutation occurred and a group of llamas developed a strange gene that gave them superpowers beyond their wildest dreams. 
Their first power was the ability to shoot fire out of their eyes!
... The llamas were eventually defeated and returned to their docile selves, leaving the world a safer place for all.

The end.
>

llama_print_timings:        load time = 10581.41 ms
llama_print_timings:      sample time =   683.01 ms /   290 runs   (    2.36 ms per token,   424.59 tokens per second)
llama_print_timings: prompt eval time = 23051.96 ms /    67 tokens (  344.06 ms per token,     2.91 tokens per second)
llama_print_timings:        eval time = 119262.20 ms /   290 runs   (  411.25 ms per token,     2.43 tokens per second)
llama_print_timings:       total time = 183115.07 ms

Test #2: cmake -B build -DCMAKE_C_FLAGS=-march=armv8.4a

./main -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin -b 7 -i -ins
main: build = 0 (unknown)
main: seed  = 1689025429
...
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> Hi, whats your name?
Hello, my name is ______________.
> Hi ____. please tell me a story about a mutant llama that took over Earth.
 Once upon a time in a distant galaxy far, far away, there was a group of aliens who had a pet llama named Luna
... 
Luna still continues to protect the world as a mutant llama superhero, inspiring generations of humans to embrace diversity and acceptance.
>

llama_print_timings:        load time =  3894.19 ms
llama_print_timings:      sample time =   709.40 ms /   269 runs   (    2.64 ms per token,   379.19 tokens per second)
llama_print_timings: prompt eval time = 14990.36 ms /    67 tokens (  223.74 ms per token,     4.47 tokens per second)
llama_print_timings:        eval time = 86043.03 ms /   269 runs   (  319.86 ms per token,     3.13 tokens per second)
llama_print_timings:       total time = 148406.83 ms

@JianbangZ
Also, it's notable that compiling is a different issue compared to GPU acelleration, so it may be worth making an issue where someone that knows more than me can chime in on compiling.

JianbangZ · 2023-07-10T23:24:50Z

I think the goal should be enabling sve instruction set introduced for newer gen arm core, I am not expert though. Also I found using threads number exactly equal to the number of A+B is optimal, for my 8gen2, I set t=5 (1 A, 4 B)
Regarding GPU acceleration, current OpenCL setup is very inefficient for mobile. MLC-llm fully use Vulcan, can reach 5-6 t/s without single bit of CPU usage. Hopefully Vulkan ggml can fully embrace which I know there are 2 PRs focusing on this

AlphaAtlas · 2023-07-11T03:27:40Z

^

And it does seem that llvm is failing to identify the ARM architecture... Maybe its just not implemented yet? Or maybe its a Qualcomm thing.

If you really want to dive into this

Try to get clang or some other LLVM component to spit out what features your CPU supports. Then you can manually add these flags in.
Try to somehow get a newer version of llvm/clang
Try cross compiling with this? It might be obsolete for all I know 🤷 https://developer.qualcomm.com/software/snapdragon-llvm-compiler-android

Nick-infinity · 2023-10-24T05:43:54Z

Vulkan

Hello @JianbangZ , MLC LLM states that they uses OpenCL on Adreno and Mali GPUs on their github repo. Although, Android is moving towards vulkan. Android OS and google in general has introduced a shim layer that converts all opencl and opengl calls to vulkan under the hood.

github-actions · 2024-04-09T01:08:33Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

JianbangZ changed the title ~~Android GPU not used at all even built with OpenCL~~ Pool Android performance and GPU not used at all when built with OpenCL Jun 29, 2023

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool Android performance and GPU not used at all when built with OpenCL #2052

Pool Android performance and GPU not used at all when built with OpenCL #2052

JianbangZ commented Jun 29, 2023

ghost commented Jun 29, 2023 •

edited by ghost

Loading

JianbangZ commented Jun 29, 2023

FNsi commented Jun 29, 2023 •

edited

Loading

ghost commented Jul 5, 2023

FNsi commented Jul 5, 2023

JianbangZ commented Jul 5, 2023

JianbangZ commented Jul 5, 2023

ghost commented Jul 5, 2023 •

edited by ghost

Loading

FNsi commented Jul 5, 2023 •

edited

Loading

ghost commented Jul 5, 2023

FNsi commented Jul 5, 2023 •

edited

Loading

AlphaAtlas commented Jul 5, 2023 •

edited

Loading

JianbangZ commented Jul 10, 2023

ghost commented Jul 10, 2023

AlphaAtlas commented Jul 10, 2023

ghost commented Jul 10, 2023

JianbangZ commented Jul 10, 2023

AlphaAtlas commented Jul 11, 2023 •

edited

Loading

Nick-infinity commented Oct 24, 2023 •

edited

Loading

github-actions bot commented Apr 9, 2024

Pool Android performance and GPU not used at all when built with OpenCL #2052

Pool Android performance and GPU not used at all when built with OpenCL #2052

Comments

JianbangZ commented Jun 29, 2023

ghost commented Jun 29, 2023 • edited by ghost Loading

JianbangZ commented Jun 29, 2023

FNsi commented Jun 29, 2023 • edited Loading

ghost commented Jul 5, 2023

FNsi commented Jul 5, 2023

JianbangZ commented Jul 5, 2023

JianbangZ commented Jul 5, 2023

ghost commented Jul 5, 2023 • edited by ghost Loading

FNsi commented Jul 5, 2023 • edited Loading

ghost commented Jul 5, 2023

FNsi commented Jul 5, 2023 • edited Loading

AlphaAtlas commented Jul 5, 2023 • edited Loading

JianbangZ commented Jul 10, 2023

ghost commented Jul 10, 2023

AlphaAtlas commented Jul 10, 2023

ghost commented Jul 10, 2023

JianbangZ commented Jul 10, 2023

AlphaAtlas commented Jul 11, 2023 • edited Loading

Nick-infinity commented Oct 24, 2023 • edited Loading

github-actions bot commented Apr 9, 2024

ghost commented Jun 29, 2023 •

edited by ghost

Loading

FNsi commented Jun 29, 2023 •

edited

Loading

ghost commented Jul 5, 2023 •

edited by ghost

Loading

FNsi commented Jul 5, 2023 •

edited

Loading

FNsi commented Jul 5, 2023 •

edited

Loading

AlphaAtlas commented Jul 5, 2023 •

edited

Loading

AlphaAtlas commented Jul 11, 2023 •

edited

Loading

Nick-infinity commented Oct 24, 2023 •

edited

Loading