-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool Android performance and GPU not used at all when built with OpenCL #2052
Comments
To enable GPU acceleration then use the -ngl # parameter. Adjust CPU usage by modifing the --threads # parameter in ./main Changing these parameters isn't gonna produce 60ms/token though - I'd love if llama.cpp fully utilised Android GPU, but Offloading to GPU decreases performance for me. It would be great if whatever they're doing is converted for llama.cpp to utilise Android GPU. |
Tried -ngl with different numbers, it makes performance worse |
Is it because termux itself? Try
|
To clarify, you use Termux to install Firefox? No, I don't use Termux like that. I would install the Firefox apk on Android in the same way Termux is installed. The point is @JianbangZ installs mlc.apk, running 7B models at 60 ms/token whereas llama.cpp is slower. |
And I was saying he need install gpu drivers in termux. Btw, I use termux with Firefox because I am using vr devices and I want a good environment to do anything. |
I should correct myself that 60 ms/t is the speed for prefill, the later decode speed is roughly 130 ms/t, but still, it's 2x faster than the current llama.cpp when built with OpenCL. I think it's because mlc fully uses Vulkan. |
I didn't know you have to or can install GPU driver in Termux? can you claborate? the GPU is just Adreno, which is pretty standard in Snapdragon chip line. |
I don't think so. Termux is only a simulator, and you have to install the specific driver to using the physical GPU. |
The issue is GPU acceleration with OpenCL. OpenGL is not currently available in llama.cpp |
I have to agree I moved too far to discuss termux.😂 |
The source is pretty simple, and lots of it is the "front end" for the chat. Much of the performance comes from TVM's profiling/autotuning... it more than doubles the Vulkan performance in my tests. A TVM backend for llama.cpp might not even be that crazy? Especially if yall start exporting the graphs in generic way. TVM supports really esoteric hardware/backends as targets (though MLC only has performance profiles for llama 7B and vulkan/metal last time I checked). |
@JackJollimore Just a quick update |
Thanks for the update. I always compile on my device, but only CMake'd for OpenCL builds. I'll CMake some regular builds tonight, along with some of these flags and see how it goes. On another note: OpenCL build produces excessively high perplexity, which is unfortunate, but at least it's a reported issue: #2133 (comment) |
@JianbangZ 🤔 You mean What version of llvm/clang is termux using? |
@AlphaAtlas
I tested a couple different CMake flags. It's interesting that the second build was slightly faster. Test #1: cmake -B build -DCMAKE_C_FLAGS=-march=native
Test #2: cmake -B build -DCMAKE_C_FLAGS=-march=armv8.4a
@JianbangZ |
I think the goal should be enabling sve instruction set introduced for newer gen arm core, I am not expert though. Also I found using threads number exactly equal to the number of A+B is optimal, for my 8gen2, I set t=5 (1 A, 4 B) |
^ And it does seem that llvm is failing to identify the ARM architecture... Maybe its just not implemented yet? Or maybe its a Qualcomm thing. If you really want to dive into this
|
Hello @JianbangZ , MLC LLM states that they uses OpenCL on Adreno and Mali GPUs on their github repo. Although, Android is moving towards vulkan. Android OS and google in general has introduced a shim layer that converts all opencl and opengl calls to vulkan under the hood. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I have a phone with snapdragon 8 gen 2 (best snapdrgon chip), and have been trying to make llama.cpp work through Termux.
I followed the compiling instructions exactly. What I found is below
(1) Method 1: Normal
$ mkdir build-android
$ cd build-android
$ export NDK=<your_ndk_directory>
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-23 -DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make
The inference speed is extremely slow, like 4 second/token
(2) Method 2: OpenCL
I follow the steps and use "make LLAMA_CLBLAST=1"
Inference speed is much faster, but still only 250 ms/token.
When I check the CPU usage using "adb shell -> top", I found only CPU is used. 400% usage.
(3) I use another Android project https://mlc.ai/mlc-llm/#android, used their prebuilt APK with vicunna 7B 4bit model, the infernce speed is 16 tokens/s AKA 60 ms/token, and CPU usage is only 70%, so seems GPU is correctly used for mlc-llm project.
The text was updated successfully, but these errors were encountered: