Please compile also clblast version! #7768

Zibri · 2024-06-05T10:43:05Z

I have a laptop with nvidia GTX 970M...
The version llama-b3078-bin-win-clblast-x64 seemed to do some offloading and speed up things.. using 3 gb of the onboard memory..
The other version I tested don't do that.

0wwafa · 2024-06-05T11:18:07Z

Yes, please! I have a similar notebook (GTX980M)... the only versions that work are the avx2 and the clblast! So now I am stuck to the latest version compiled with clblast ...

JohannesGaessler · 2024-06-05T11:22:18Z

GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back.

metal3d · 2024-06-05T13:01:22Z

Vulkan crashes at this time, while CLBlast worked, see #7769

0wwafa · 2024-06-05T13:05:08Z

with MSYS2 I just did:

git checkout 0cd6bd3483fa66124b76a8a8ac794d9ee18c70c1
pacman -Su mingw-w64-clang-x86_64-clblast
pacman -Su mingw-w64-clang-x86_64-opencl-clhpp
cmake -B build -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_CLBLAST=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build -j $(nproc)

And the compilation worked.

llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5671.02 MiB
llm_load_tensors:     OpenCL buffer size =  1365.25 MiB

0wwafa · 2024-06-05T13:06:56Z

GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back.

I would but I don't have the skill to do that... what I know is that using MSYS2 and CLANG64 llama.cpp compiles perfectly. (until it was removed opencl support).
And I can't make any other version use my GTX980M small memory to offload at least a few tensors.

shibe2 · 2024-06-05T13:47:11Z

I'm interested in working on OpenCL back-end. But I would rewrite it rather than maintain CLBlast back-end that was recently removed.

JohannesGaessler · 2024-06-05T13:54:54Z

If you want an explanation of the code and want to talk feel free to contact me via email.

shibe2 · 2024-06-05T14:02:53Z

I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it.

0cc4m · 2024-06-05T16:40:37Z

I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it.

That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it.

0wwafa · 2024-06-06T11:18:39Z

even an old laptop/old gpu could have some value with this enabled. cuda libraries are HUGE and in the end I did not notice any improvement in speed .. with clblast instead at least the gpu is used.

sorasoras · 2024-06-06T11:37:40Z

I through they're removing CLblast support.

Zibri · 2024-06-06T12:57:24Z

I through they're removing CLblast support.

They are unless they find a maintainer.

shibe2 · 2024-06-06T14:00:26Z

It already was removed in 554c247, so to resolve this issue positively, it would be needed to add that back-end back.

MaggotHATE · 2024-06-08T06:31:39Z

Keep in mind that Vulkan backend still has memory allocation issues that affect RAM usage. This can prevent you from using 7b models in higher quants and larger models on 16GB RAM, for example. Simply removing Clblast before this issue is fixed was a bad idea.

Considering small footprint of Clblast implementation and relative unpopularity of MoE, it's better to revert the change.

Abhishek8394 · 2024-06-08T06:39:57Z

Just want to add that if Clblast returns, I'd be happy to volunteer my time as much as I can.

shibe2 · 2024-06-08T12:12:19Z

What you can do right now is to make a fork with CLBlast back-end restored and maintain it.

One way to do it is to revert the commit 554c247 and in the future, rebase your branch on upstream. For example, I did that here: https://github.com/shibe2/llama.cpp/tree/clblast I restored only files that contain functional code to avoid dealing with additional conflicts when documentation will be changed. By the way, anyone who wants to keep using CLBlast back-end can use the patch from my example fork while it applies cleanly.

Another way is to fork at the last commit 0cd6bd3 before removal of CLBlast and cherry-pick commits from upstream excluding the one that removes CLBlast.

MaggotHATE · 2024-06-08T12:32:56Z

@shibe2 Thank you! I was wondering if Clblast removal was somehow necessary for rope refactoring, but looks like it wasn't. Now I really don't see a reason for 554c247.

I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE.

JohannesGaessler · 2024-06-08T12:38:42Z

I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE.

There are issues even with non-MoE models.

shibe2 · 2024-06-08T12:40:31Z

I haven't used MoE models, so I haven't looked into it. What other models had problems with CLBlast back-end?

JohannesGaessler · 2024-06-08T13:08:33Z

There is this issue #7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL).

MaggotHATE · 2024-06-08T13:30:59Z

There is this issue #7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL).

This looks more like an ARM-specific cornercase. At the same time, Clblast works correctly with Llama 2 and following models on x86-64. I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so.

Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only.

JohannesGaessler · 2024-06-08T13:36:02Z

I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so.

I don't either but I do see reason to remove something that does require maintenance effort with no dev willing to put it in.

Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only.

That's the state for essentially the entire project. If any of the core devs were to be run over by a bus there would be no one to replace them.

shibe2 · 2024-06-08T14:12:32Z

There is this issue #7661 regarding LLaMA 2.

Regarding it producing incorrect results, it may be a local issue, not necessarily a problem in llama.cpp code. We don't know because the back-end was nuked instead of investigating the problem.

Regarding CLBlast being slower than CPU, it is expected in some cases where RAM bandwidth is the performance bottleneck. CLBlast needs unquantized data, and so model parameters are dequantized and stored in RAM before multiplication. Unless it fully fits in cache, it adds memory writes and reads compared to doing the same computation on CPU. That's why I would write OpenCL back-end in a different way rather than keep relying on CLBlast.

Haus1 · 2024-06-08T20:10:07Z

That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it.

Unfortunately AMD pulled the old bait-and-switch and nerfed this functionality for a large portion of their GPU lineup. Last I checked Steam was reporting the vast majority of systems with AMD GPUs fell into this camp.

I found an up-to 4000% decrease of performance just by switching from sycl::malloc_device() to sycl::malloc_shared() - even if all I do is repeatedly resubmitting the same SYCL kernel
most AMD discrete GPUs either disabled the XNACK feature by default, or unsupport it outright. Even though the silicon theoretically appears to have this capabilities since GFX8
https://stackoverflow.com/questions/76700305/4000-performance-decrease-in-sycl-when-using-unified-shared-memory-instead-of-d

While clblast wasn't the most performant it was very robust when it came to squeezing every last byte into VRAM. It handled overallocation seamlessly.
Vulkan is much more unforgiving and will crash my X server, requiring a forceful reloading of the amdgpu driver or a reboot.
HIP is even worse and over allocating will often require a hard reboot - as in fully powering down the system and letting the caps discharge

shibe2 · 2024-06-08T20:22:35Z

Interesting. I was not using hipBLAS back-end because at that time it was wasting VRAM. But now it seems less memory-hungry.

As for power cycling being required, it may be because of a hardware or firmware problem. I used to have something like that with AMD GPU.

Zibri · 2024-06-10T14:50:21Z

I think the main feature of llama.cpp is it's efficiency. clblast just added to it.
I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is.
please re-add clblast.

0wwafa · 2024-06-12T08:38:00Z

I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.

I completely agree, @Zibri

0cc4m · 2024-06-12T14:36:37Z

I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.

This is an open source project. Noone here is opposed to having the backend, but someone has to put in the work, keep it up to date and fix bugs. Noone has worked on the OpenCL backend for a long time now, that's why it was removed. If you wanna take over maintaining it, you're welcome to.

MaggotHATE · 2024-06-12T16:10:55Z

Noone has worked on the OpenCL backend for a long time now, that's why it was removed

If it was completely broken by this point - sure, but it's not. In fact, with Clblast removed there's no other "universal" backend to test Vulkan against. AFAIK all other backends are platform-specific.

shibe2 · 2024-06-12T16:20:48Z

You can add it back for yourself while it is not broken. When it breaks, someone will need to update it.

MaggotHATE · 2024-06-12T16:36:37Z

You can add it back for yourself while it is not broken.

Which will create questions about implementation and doubts like "It's not a part of llama.cpp anymore."

All I'm saying is, removing Clblast gained llama.cpp nothing. This backend, while partially broken, wasn't preventing anything else from working and was sitting in it's own place - and yes, it's easy to add back for that exact reason.

JohannesGaessler · 2024-06-12T16:48:13Z

All I'm saying is, removing Clblast gained llama.cpp nothing.

Removing the backend resulted in a reduction in maintenance effort for developers. And it also removed a footgun for users which in turn reduces the number of bug reports and the amount of work developers have in terms of sifting through Github issues.

MaggotHATE · 2024-06-12T17:08:14Z

in maintenance effort for developers

...but it wasn't maintained?

removed a footgun for users

A warning during compilation would be enough, to my mind - and issues could be automatically ignored in that case.

JohannesGaessler · 2024-06-12T17:15:46Z

...but it wasn't maintained?

You still need to invest some effort in order to not completely break it. As I previously said, the OpenCL code was crudely grafted to the CPU code.

MaggotHATE · 2024-06-12T17:28:50Z

You still need to invest some effort in order to not completely break it

Ok, so it was preventing something from working (even if in future)? This changes perspective.

Still, it was removed before Vulkan is on par with it memory-wise. Like I mentioned before, this will cause issues in future.

ccbadd · 2024-06-12T20:21:29Z

Don't you still have Kompute to test Vulkan against?

MaggotHATE · 2024-06-13T04:34:19Z

Kompute

I'm not even sure it's still maintained - barely see any mentioning of it in PRs and commits since it was added. I remember it lacking partial offloading, and if it's still the case, it's pointless.

0wwafa · 2024-06-13T17:20:54Z

all I know tis that it was working. and now it's not :(

Abhishek8394 · 2024-06-15T09:58:42Z

For anyone still looking for opencl, I will try to maintain a fork at https://github.com/Abhishek8394/llama.cpp under the branch opencl. It is up-to-date with llama.cpp as of now i.e. commit 7b2f4a7.

Anyone who wants to help maintain this, is welcome to. Also if anyone can contribute automated builds of just opencl, that would be awesome too.

MaggotHATE · 2024-06-15T10:26:23Z

@Abhishek8394 please note that ggml_backend_buffer_type_i struct in ggml-backend-impl.h was changed (it no longer needs to check if backend is fully supported or not), and Clblast will not compile unmodified. I've made a simple change for my own program, and it seems to work so far.

0wwafa · 2024-06-18T23:14:45Z

@Abhishek8394 please note that ggml_backend_buffer_type_i struct in ggml-backend-impl.h was changed (it no longer needs to check if backend is fully supported or not), and Clblast will not compile unmodified. I've made a simple change for my own program, and it seems to work so far.

I sincerely hope he will reconsider and add it back.

github-actions · 2024-08-02T01:20:45Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Zibri · 2024-08-02T23:22:15Z

@ggerganov kobold.cpp still supports it. Why can't llama.cpp?

0cc4m · 2024-08-03T08:26:25Z

kobold.cpp still supports it. Why can't llama.cpp?

Will you maintain it?

0wwafa · 2024-08-03T12:45:24Z

kobold.cpp still supports it. Why can't llama.cpp?

Will you maintain it?

kobol.cpp people are using it.. just get it from them, no?

0cc4m · 2024-08-04T07:31:50Z

kobold.cpp still supports it. Why can't llama.cpp?

Will you maintain it?

kobol.cpp people are using it.. just get it from them, no?

You can use koboldcpp if you want to use that. The thing is that @LostRuins is putting in the minimal amount of work to keep the CLBlast version running, but he's not actually maintaining the backend. If a feature doesn't work with CLBlast, it just gets disabled in that version of koboldcpp, instead of implementing it in the CLBlast backend. That is not how llama.cpp operates.

If you want to reintroduce it to llama.cpp, find someone to bring it up to date with the backend system in llama.cpp and to improve and maintain it over time.

Zibri changed the title ~~Please compile also opencl!~~ Please compile also clblast version! Jun 5, 2024

github-actions bot added the stale label Jul 19, 2024

github-actions bot closed this as completed Aug 2, 2024

Please compile also clblast version! #7768

Please compile also clblast version! #7768

Comments

Zibri commented Jun 5, 2024

0wwafa commented Jun 5, 2024 • edited Loading

JohannesGaessler commented Jun 5, 2024

metal3d commented Jun 5, 2024

0wwafa commented Jun 5, 2024

0wwafa commented Jun 5, 2024

shibe2 commented Jun 5, 2024

JohannesGaessler commented Jun 5, 2024

shibe2 commented Jun 5, 2024

0cc4m commented Jun 5, 2024

0wwafa commented Jun 6, 2024

sorasoras commented Jun 6, 2024

Zibri commented Jun 6, 2024

shibe2 commented Jun 6, 2024

MaggotHATE commented Jun 8, 2024 • edited Loading

Abhishek8394 commented Jun 8, 2024

shibe2 commented Jun 8, 2024

MaggotHATE commented Jun 8, 2024

JohannesGaessler commented Jun 8, 2024

shibe2 commented Jun 8, 2024

JohannesGaessler commented Jun 8, 2024 • edited Loading

MaggotHATE commented Jun 8, 2024

JohannesGaessler commented Jun 8, 2024

shibe2 commented Jun 8, 2024

Haus1 commented Jun 8, 2024

shibe2 commented Jun 8, 2024

Zibri commented Jun 10, 2024

0wwafa commented Jun 12, 2024

0cc4m commented Jun 12, 2024

MaggotHATE commented Jun 12, 2024

shibe2 commented Jun 12, 2024

MaggotHATE commented Jun 12, 2024

JohannesGaessler commented Jun 12, 2024

MaggotHATE commented Jun 12, 2024

JohannesGaessler commented Jun 12, 2024

MaggotHATE commented Jun 12, 2024

ccbadd commented Jun 12, 2024

MaggotHATE commented Jun 13, 2024

0wwafa commented Jun 13, 2024

Abhishek8394 commented Jun 15, 2024

MaggotHATE commented Jun 15, 2024

0wwafa commented Jun 18, 2024

github-actions bot commented Aug 2, 2024

Zibri commented Aug 2, 2024

0cc4m commented Aug 3, 2024

0wwafa commented Aug 3, 2024

0cc4m commented Aug 4, 2024

0wwafa commented Jun 5, 2024 •

edited

Loading

MaggotHATE commented Jun 8, 2024 •

edited

Loading

JohannesGaessler commented Jun 8, 2024 •

edited

Loading