-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please compile also clblast version! #7768
Comments
Yes, please! I have a similar notebook (GTX980M)... the only versions that work are the avx2 and the clblast! So now I am stuck to the latest version compiled with clblast ... |
GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back. |
Vulkan crashes at this time, while CLBlast worked, see #7769 |
with MSYS2 I just did:
And the compilation worked.
|
I would but I don't have the skill to do that... what I know is that using MSYS2 and CLANG64 llama.cpp compiles perfectly. (until it was removed opencl support). |
I'm interested in working on OpenCL back-end. But I would rewrite it rather than maintain CLBlast back-end that was recently removed. |
If you want an explanation of the code and want to talk feel free to contact me via email. |
I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it. |
That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it. |
even an old laptop/old gpu could have some value with this enabled. cuda libraries are HUGE and in the end I did not notice any improvement in speed .. with clblast instead at least the gpu is used. |
I through they're removing CLblast support. |
They are unless they find a maintainer. |
It already was removed in 554c247, so to resolve this issue positively, it would be needed to add that back-end back. |
Keep in mind that Vulkan backend still has memory allocation issues that affect RAM usage. This can prevent you from using 7b models in higher quants and larger models on 16GB RAM, for example. Simply removing Clblast before this issue is fixed was a bad idea. Considering small footprint of Clblast implementation and relative unpopularity of MoE, it's better to revert the change. |
Just want to add that if Clblast returns, I'd be happy to volunteer my time as much as I can. |
What you can do right now is to make a fork with CLBlast back-end restored and maintain it. One way to do it is to revert the commit 554c247 and in the future, rebase your branch on upstream. For example, I did that here: https://github.com/shibe2/llama.cpp/tree/clblast I restored only files that contain functional code to avoid dealing with additional conflicts when documentation will be changed. By the way, anyone who wants to keep using CLBlast back-end can use the patch from my example fork while it applies cleanly. Another way is to fork at the last commit 0cd6bd3 before removal of CLBlast and cherry-pick commits from upstream excluding the one that removes CLBlast. |
@shibe2 Thank you! I was wondering if Clblast removal was somehow necessary for rope refactoring, but looks like it wasn't. Now I really don't see a reason for 554c247. I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE. |
There are issues even with non-MoE models. |
I haven't used MoE models, so I haven't looked into it. What other models had problems with CLBlast back-end? |
There is this issue #7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL). |
This looks more like an ARM-specific cornercase. At the same time, Clblast works correctly with Llama 2 and following models on x86-64. I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so. Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only. |
I don't either but I do see reason to remove something that does require maintenance effort with no dev willing to put it in.
That's the state for essentially the entire project. If any of the core devs were to be run over by a bus there would be no one to replace them. |
Regarding it producing incorrect results, it may be a local issue, not necessarily a problem in llama.cpp code. We don't know because the back-end was nuked instead of investigating the problem. Regarding CLBlast being slower than CPU, it is expected in some cases where RAM bandwidth is the performance bottleneck. CLBlast needs unquantized data, and so model parameters are dequantized and stored in RAM before multiplication. Unless it fully fits in cache, it adds memory writes and reads compared to doing the same computation on CPU. That's why I would write OpenCL back-end in a different way rather than keep relying on CLBlast. |
Unfortunately AMD pulled the old bait-and-switch and nerfed this functionality for a large portion of their GPU lineup. Last I checked Steam was reporting the vast majority of systems with AMD GPUs fell into this camp.
While clblast wasn't the most performant it was very robust when it came to squeezing every last byte into VRAM. It handled overallocation seamlessly. |
Interesting. I was not using hipBLAS back-end because at that time it was wasting VRAM. But now it seems less memory-hungry. As for power cycling being required, it may be because of a hardware or firmware problem. I used to have something like that with AMD GPU. |
I think the main feature of llama.cpp is it's efficiency. clblast just added to it. |
I completely agree, @Zibri |
This is an open source project. Noone here is opposed to having the backend, but someone has to put in the work, keep it up to date and fix bugs. Noone has worked on the OpenCL backend for a long time now, that's why it was removed. If you wanna take over maintaining it, you're welcome to. |
If it was completely broken by this point - sure, but it's not. In fact, with Clblast removed there's no other "universal" backend to test Vulkan against. AFAIK all other backends are platform-specific. |
You can add it back for yourself while it is not broken. When it breaks, someone will need to update it. |
Which will create questions about implementation and doubts like "It's not a part of llama.cpp anymore." All I'm saying is, removing Clblast gained llama.cpp nothing. This backend, while partially broken, wasn't preventing anything else from working and was sitting in it's own place - and yes, it's easy to add back for that exact reason. |
Removing the backend resulted in a reduction in maintenance effort for developers. And it also removed a footgun for users which in turn reduces the number of bug reports and the amount of work developers have in terms of sifting through Github issues. |
...but it wasn't maintained?
A warning during compilation would be enough, to my mind - and issues could be automatically ignored in that case. |
You still need to invest some effort in order to not completely break it. As I previously said, the OpenCL code was crudely grafted to the CPU code. |
Ok, so it was preventing something from working (even if in future)? This changes perspective. Still, it was removed before Vulkan is on par with it memory-wise. Like I mentioned before, this will cause issues in future. |
Don't you still have Kompute to test Vulkan against? |
I'm not even sure it's still maintained - barely see any mentioning of it in PRs and commits since it was added. I remember it lacking partial offloading, and if it's still the case, it's pointless. |
all I know tis that it was working. and now it's not :( |
For anyone still looking for opencl, I will try to maintain a fork at https://github.com/Abhishek8394/llama.cpp under the branch Anyone who wants to help maintain this, is welcome to. Also if anyone can contribute automated builds of just opencl, that would be awesome too. |
@Abhishek8394 please note that |
I sincerely hope he will reconsider and add it back. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
@ggerganov kobold.cpp still supports it. Why can't llama.cpp? |
Will you maintain it? |
kobol.cpp people are using it.. just get it from them, no? |
You can use koboldcpp if you want to use that. The thing is that @LostRuins is putting in the minimal amount of work to keep the CLBlast version running, but he's not actually maintaining the backend. If a feature doesn't work with CLBlast, it just gets disabled in that version of koboldcpp, instead of implementing it in the CLBlast backend. That is not how llama.cpp operates. If you want to reintroduce it to llama.cpp, find someone to bring it up to date with the backend system in llama.cpp and to improve and maintain it over time. |
I have a laptop with nvidia GTX 970M...
The version llama-b3078-bin-win-clblast-x64 seemed to do some offloading and speed up things.. using 3 gb of the onboard memory..
The other version I tested don't do that.
The text was updated successfully, but these errors were encountered: