Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool Android performance and GPU not used at all when built with OpenCL #2052

Closed
JianbangZ opened this issue Jun 29, 2023 · 20 comments
Closed
Labels

Comments

@JianbangZ
Copy link

I have a phone with snapdragon 8 gen 2 (best snapdrgon chip), and have been trying to make llama.cpp work through Termux.
I followed the compiling instructions exactly. What I found is below
(1) Method 1: Normal
$ mkdir build-android
$ cd build-android
$ export NDK=<your_ndk_directory>
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make

The inference speed is extremely slow, like 4 second/token

(2) Method 2: OpenCL
I follow the steps and use "make LLAMA_CLBLAST=1"
Inference speed is much faster, but still only 250 ms/token.
When I check the CPU usage using "adb shell -> top", I found only CPU is used. 400% usage.

(3) I use another Android project https://mlc.ai/mlc-llm/#android, used their prebuilt APK with vicunna 7B 4bit model, the infernce speed is 16 tokens/s AKA 60 ms/token, and CPU usage is only 70%, so seems GPU is correctly used for mlc-llm project.

@JianbangZ JianbangZ changed the title Android GPU not used at all even built with OpenCL Pool Android performance and GPU not used at all when built with OpenCL Jun 29, 2023
@ghost
Copy link

ghost commented Jun 29, 2023

(2) Method 2: OpenCL I follow the steps and use "make LLAMA_CLBLAST=1" Inference speed is much faster, but still only 250 ms/token.

To enable GPU acceleration then use the -ngl # parameter. Adjust CPU usage by modifing the --threads # parameter in ./main

Changing these parameters isn't gonna produce 60ms/token though - I'd love if llama.cpp fully utilised Android GPU, but Offloading to GPU decreases performance for me.

It would be great if whatever they're doing is converted for llama.cpp to utilise Android GPU.

@JianbangZ
Copy link
Author

(2) Method 2: OpenCL I follow the steps and use "make LLAMA_CLBLAST=1" Inference speed is much faster, but still only 250 ms/token.

To enable GPU acceleration then use the -ngl # parameter. Adjust CPU usage by modifing the --threads # parameter in ./main

Changing these parameters isn't gonna produce 60ms/token though - I'd love if llama.cpp fully utilised Android GPU, but Offloading to GPU decreases performance for me.

It would be great if whatever they're doing is converted for llama.cpp to fully utilise the GPU.

Tried -ngl with different numbers, it makes performance worse

@FNsi
Copy link
Contributor

FNsi commented Jun 29, 2023

Is it because termux itself?
I asked because I can't play videos well by Firefox which installed in termux Ubuntu.

Try

pkg i virglrenderer-android

@ghost
Copy link

ghost commented Jul 5, 2023

Is it because termux itself? I asked because I can't play videos well by Firefox which installed in termux Ubuntu.

Try

pkg i virglrenderer-android

To clarify, you use Termux to install Firefox? No, I don't use Termux like that. I would install the Firefox apk on Android in the same way Termux is installed.

The point is @JianbangZ installs mlc.apk, running 7B models at 60 ms/token whereas llama.cpp is slower.

@FNsi
Copy link
Contributor

FNsi commented Jul 5, 2023

pkg i virglrenderer-android

To clarify, you use Termux to install Firefox? No, I don't use Termux like that. I would install the Firefox apk on Android in the same way Termux is installed.

The point is @JianbangZ installs mlc.apk, running 7B models at 60 ms/token whereas llama.cpp is slower.

And I was saying he need install gpu drivers in termux.

Btw, I use termux with Firefox because I am using vr devices and I want a good environment to do anything.

@JianbangZ
Copy link
Author

I should correct myself that 60 ms/t is the speed for prefill, the later decode speed is roughly 130 ms/t, but still, it's 2x faster than the current llama.cpp when built with OpenCL. I think it's because mlc fully uses Vulkan.

@JianbangZ
Copy link
Author

pkg i virglrenderer-android

To clarify, you use Termux to install Firefox? No, I don't use Termux like that. I would install the Firefox apk on Android in the same way Termux is installed.
The point is @JianbangZ installs mlc.apk, running 7B models at 60 ms/token whereas llama.cpp is slower.

And I was saying he need install gpu drivers in termux.

Btw, I use termux with Firefox because I am using vr devices and I want a good environment to do anything.

I didn't know you have to or can install GPU driver in Termux? can you claborate? the GPU is just Adreno, which is pretty standard in Snapdragon chip line.

@ghost
Copy link

ghost commented Jul 5, 2023

I think it's because mlc fully uses Vulkan.

#2059 is @0cc4m's work toward vulkan implementation.

Supposedly, mlc.apk source is available - I wonder if there's anything in the source for llama.cpp to improve GPU acceleration on Android.

@FNsi
Copy link
Contributor

FNsi commented Jul 5, 2023

the GPU is just Adreno, which is pretty standard in Snapdragon chip line.

I don't think so. Termux is only a simulator, and you have to install the specific driver to using the physical GPU.

guide-running-linux-on-android-with-3d-acceleration-opengl

@ghost
Copy link

ghost commented Jul 5, 2023

the GPU is just Adreno, which is pretty standard in Snapdragon chip line.

I don't think so. Termux is only a simulator, and you have to install the specific driver to using the physical GPU.

The issue is GPU acceleration with OpenCL. OpenGL is not currently available in llama.cpp

@FNsi
Copy link
Contributor

FNsi commented Jul 5, 2023

The issue is GPU acceleration with OpenCL. OpenGL is not currently available in llama.cpp

I have to agree I moved too far to discuss termux.😂

@AlphaAtlas
Copy link

AlphaAtlas commented Jul 5, 2023

I think it's because mlc fully uses Vulkan.

#2059 is @0cc4m's work toward vulkan implementation.

Supposedly, mlc.apk source is available - I wonder if there's anything in the source for llama.cpp to improve GPU acceleration on Android.

The source is pretty simple, and lots of it is the "front end" for the chat.

Much of the performance comes from TVM's profiling/autotuning... it more than doubles the Vulkan performance in my tests.

A TVM backend for llama.cpp might not even be that crazy? Especially if yall start exporting the graphs in generic way. TVM supports really esoteric hardware/backends as targets (though MLC only has performance profiles for llama 7B and vulkan/metal last time I checked).

@JianbangZ
Copy link
Author

@JackJollimore Just a quick update
I ended up compiling inside the phone (snapdragon 8gen2), which is faster. If I just do cmake .., q4_0 model gives 4.2 t/s, if I add DCMAKE_C_FLAGS=-march=armv8.4a (dotprod is enabled be default for 8.4a), it gives 5.5 t/s, if I use q4_k_s model, it's close to 7 t/s. These are all on CPU. But I still haven't figured out how to enable v8.4+(v8.5a, v8.7a , v9a) or SVE, all gives illegal instruction.
In general I think the gen 1/2 Cortex-X2/3 core is giving some troubles

@ghost
Copy link

ghost commented Jul 10, 2023

If I just do cmake .., q4_0 model gives 4.2 t/s, if I add DCMAKE_C_FLAGS=-march=armv8.4a (dotprod is enabled be default for 8.4a), it gives 5.5 t/s, if I use q4_k_s model, it's close to 7 t/s. These are all on CPU. But I still haven't figured out how to enable v8.4+(v8.5a, v8.7a , v9a) or SVE, all gives illegal instruction. In general I think the gen 1/2 Cortex-X2/3 core is giving some troubles

Thanks for the update. I always compile on my device, but only CMake'd for OpenCL builds. I'll CMake some regular builds tonight, along with some of these flags and see how it goes.

On another note: OpenCL build produces excessively high perplexity, which is unfortunate, but at least it's a reported issue: #2133 (comment)

@AlphaAtlas
Copy link

@JianbangZ 🤔 You mean march=native isn't enabling everything that's supported?

What version of llvm/clang is termux using?

@ghost
Copy link

ghost commented Jul 10, 2023

@AlphaAtlas
Here's llvm & clang info. in Termux:

clang version 16.0.6
Target: aarch64-unknown-linux-android24
Thread model: posix
InstalledDir: /data/data/com.termux/files/usr/bin

llvm/stable,now 16.0.6-1 aarch64

I tested a couple different CMake flags. It's interesting that the second build was slightly faster.

Test #1: cmake -B build -DCMAKE_C_FLAGS=-march=native

./main -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin -b 7 -i -ins  main: build = 0 (unknown)                         main: seed  = 1689024887           
...
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> hi what's your name?
Hi, I am John.
> Hi john. please tell me a story about a mutant llama that took over Earth.
 Certainly! Once upon a time, in a far-off land called Bolivia, there lived a rare and unusual creature known as the llama.
 Llamas were typically docile creatures, but one day, a mutation occurred and a group of llamas developed a strange gene that gave them superpowers beyond their wildest dreams. 
Their first power was the ability to shoot fire out of their eyes!
... The llamas were eventually defeated and returned to their docile selves, leaving the world a safer place for all.

The end.
>

llama_print_timings:        load time = 10581.41 ms
llama_print_timings:      sample time =   683.01 ms /   290 runs   (    2.36 ms per token,   424.59 tokens per second)
llama_print_timings: prompt eval time = 23051.96 ms /    67 tokens (  344.06 ms per token,     2.91 tokens per second)
llama_print_timings:        eval time = 119262.20 ms /   290 runs   (  411.25 ms per token,     2.43 tokens per second)
llama_print_timings:       total time = 183115.07 ms

Test #2: cmake -B build -DCMAKE_C_FLAGS=-march=armv8.4a

./main -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin -b 7 -i -ins
main: build = 0 (unknown)
main: seed  = 1689025429
...
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 7, n_predict = -1, n_keep = 2


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> Hi, whats your name?
Hello, my name is ______________.
> Hi ____. please tell me a story about a mutant llama that took over Earth.
 Once upon a time in a distant galaxy far, far away, there was a group of aliens who had a pet llama named Luna
... 
Luna still continues to protect the world as a mutant llama superhero, inspiring generations of humans to embrace diversity and acceptance.
>

llama_print_timings:        load time =  3894.19 ms
llama_print_timings:      sample time =   709.40 ms /   269 runs   (    2.64 ms per token,   379.19 tokens per second)
llama_print_timings: prompt eval time = 14990.36 ms /    67 tokens (  223.74 ms per token,     4.47 tokens per second)
llama_print_timings:        eval time = 86043.03 ms /   269 runs   (  319.86 ms per token,     3.13 tokens per second)
llama_print_timings:       total time = 148406.83 ms

@JianbangZ
Also, it's notable that compiling is a different issue compared to GPU acelleration, so it may be worth making an issue where someone that knows more than me can chime in on compiling.

@JianbangZ
Copy link
Author

I think the goal should be enabling sve instruction set introduced for newer gen arm core, I am not expert though. Also I found using threads number exactly equal to the number of A+B is optimal, for my 8gen2, I set t=5 (1 A, 4 B)
Regarding GPU acceleration, current OpenCL setup is very inefficient for mobile. MLC-llm fully use Vulcan, can reach 5-6 t/s without single bit of CPU usage. Hopefully Vulkan ggml can fully embrace which I know there are 2 PRs focusing on this

@AlphaAtlas
Copy link

AlphaAtlas commented Jul 11, 2023

^

And it does seem that llvm is failing to identify the ARM architecture... Maybe its just not implemented yet? Or maybe its a Qualcomm thing.

If you really want to dive into this

@Nick-infinity
Copy link

Nick-infinity commented Oct 24, 2023

Vulkan

Hello @JianbangZ , MLC LLM states that they uses OpenCL on Adreno and Mali GPUs on their github repo. Although, Android is moving towards vulkan. Android OS and google in general has introduced a shim layer that converts all opencl and opengl calls to vulkan under the hood.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants