Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please compile also clblast version! #7768

Closed
Zibri opened this issue Jun 5, 2024 · 46 comments
Closed

Please compile also clblast version! #7768

Zibri opened this issue Jun 5, 2024 · 46 comments
Labels

Comments

@Zibri
Copy link

Zibri commented Jun 5, 2024

I have a laptop with nvidia GTX 970M...
The version llama-b3078-bin-win-clblast-x64 seemed to do some offloading and speed up things.. using 3 gb of the onboard memory..
The other version I tested don't do that.

@Zibri Zibri changed the title Please compile also opencl! Please compile also clblast version! Jun 5, 2024
@0wwafa
Copy link

0wwafa commented Jun 5, 2024

Yes, please! I have a similar notebook (GTX980M)... the only versions that work are the avx2 and the clblast! So now I am stuck to the latest version compiled with clblast ...

@JohannesGaessler
Copy link
Collaborator

GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back.

@metal3d
Copy link
Contributor

metal3d commented Jun 5, 2024

Vulkan crashes at this time, while CLBlast worked, see #7769

@0wwafa
Copy link

0wwafa commented Jun 5, 2024

with MSYS2 I just did:

git checkout 0cd6bd3483fa66124b76a8a8ac794d9ee18c70c1
pacman -Su mingw-w64-clang-x86_64-clblast
pacman -Su mingw-w64-clang-x86_64-opencl-clhpp
cmake -B build -DLLAMA_NATIVE=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_CLBLAST=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build -j $(nproc)

And the compilation worked.

llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors:        CPU buffer size =  5671.02 MiB
llm_load_tensors:     OpenCL buffer size =  1365.25 MiB

@0wwafa
Copy link

0wwafa commented Jun 5, 2024

GTX900 should have both CUDA and Vulkan support both of which should be faster and better supported than OpenCL. In any case, unless someone volunteers to maintain the OpenCL backend it will not be added back.

I would but I don't have the skill to do that... what I know is that using MSYS2 and CLANG64 llama.cpp compiles perfectly. (until it was removed opencl support).
And I can't make any other version use my GTX980M small memory to offload at least a few tensors.

@shibe2
Copy link
Contributor

shibe2 commented Jun 5, 2024

I'm interested in working on OpenCL back-end. But I would rewrite it rather than maintain CLBlast back-end that was recently removed.

@JohannesGaessler
Copy link
Collaborator

If you want an explanation of the code and want to talk feel free to contact me via email.

@shibe2
Copy link
Contributor

shibe2 commented Jun 5, 2024

I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it.

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 5, 2024

I worked on CLBlast back-end in the past, and I'm familiar with the code. My reason to go for a rewrite would be different targets for optimization, mostly devices where CPU and GPU use the same physical RAM. Unfortunately, I currently don't have any suitable device, so I'm not working on it.

That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it.

@0wwafa
Copy link

0wwafa commented Jun 6, 2024

even an old laptop/old gpu could have some value with this enabled. cuda libraries are HUGE and in the end I did not notice any improvement in speed .. with clblast instead at least the gpu is used.

@sorasoras
Copy link

I through they're removing CLblast support.

@Zibri
Copy link
Author

Zibri commented Jun 6, 2024

I through they're removing CLblast support.

They are unless they find a maintainer.

@shibe2
Copy link
Contributor

shibe2 commented Jun 6, 2024

It already was removed in 554c247, so to resolve this issue positively, it would be needed to add that back-end back.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Jun 8, 2024

Keep in mind that Vulkan backend still has memory allocation issues that affect RAM usage. This can prevent you from using 7b models in higher quants and larger models on 16GB RAM, for example. Simply removing Clblast before this issue is fixed was a bad idea.

Considering small footprint of Clblast implementation and relative unpopularity of MoE, it's better to revert the change.

@Abhishek8394
Copy link

Just want to add that if Clblast returns, I'd be happy to volunteer my time as much as I can.

@shibe2
Copy link
Contributor

shibe2 commented Jun 8, 2024

What you can do right now is to make a fork with CLBlast back-end restored and maintain it.

One way to do it is to revert the commit 554c247 and in the future, rebase your branch on upstream. For example, I did that here: https://github.com/shibe2/llama.cpp/tree/clblast I restored only files that contain functional code to avoid dealing with additional conflicts when documentation will be changed. By the way, anyone who wants to keep using CLBlast back-end can use the patch from my example fork while it applies cleanly.

Another way is to fork at the last commit 0cd6bd3 before removal of CLBlast and cherry-pick commits from upstream excluding the one that removes CLBlast.

@MaggotHATE
Copy link
Contributor

@shibe2 Thank you! I was wondering if Clblast removal was somehow necessary for rope refactoring, but looks like it wasn't. Now I really don't see a reason for 554c247.

I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE.

@JohannesGaessler
Copy link
Collaborator

I also want to point out that all this time, while being unmaintained, Clblast just worked - apart from MoE, of course. However, most new models are not MoE.

There are issues even with non-MoE models.

@shibe2
Copy link
Contributor

shibe2 commented Jun 8, 2024

I haven't used MoE models, so I haven't looked into it. What other models had problems with CLBlast back-end?

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jun 8, 2024

There is this issue #7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL).

@MaggotHATE
Copy link
Contributor

There is this issue #7661 regarding LLaMA 2. Even disregarding concrete issues with incorrect results an issue (for developers) is that unlike CUDA/Vulkan/Metal OpenCL does not fit the "backend" architecture that was at some point developed. At the time of removal it was in essence just crudely grafted onto the CPU backend which causes additional work (it was like that because that is how GPU backends started and no one has been maintaining OpenCL).

This looks more like an ARM-specific cornercase. At the same time, Clblast works correctly with Llama 2 and following models on x86-64. I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so.

Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only.

@JohannesGaessler
Copy link
Collaborator

I still don't see a reason to just remove something that works in most cases and doesn't require maintenance to do so.

I don't either but I do see reason to remove something that does require maintenance effort with no dev willing to put it in.

Additionally, if the main reason is that there's nobody maintaining Clblast, please consider that Vulkan is maintained by essentially 1 person only.

That's the state for essentially the entire project. If any of the core devs were to be run over by a bus there would be no one to replace them.

@shibe2
Copy link
Contributor

shibe2 commented Jun 8, 2024

There is this issue #7661 regarding LLaMA 2.

Regarding it producing incorrect results, it may be a local issue, not necessarily a problem in llama.cpp code. We don't know because the back-end was nuked instead of investigating the problem.

Regarding CLBlast being slower than CPU, it is expected in some cases where RAM bandwidth is the performance bottleneck. CLBlast needs unquantized data, and so model parameters are dequantized and stored in RAM before multiplication. Unless it fully fits in cache, it adds memory writes and reads compared to doing the same computation on CPU. That's why I would write OpenCL back-end in a different way rather than keep relying on CLBlast.

@Haus1
Copy link

Haus1 commented Jun 8, 2024

That is also something that could be done on the Vulkan backend. I've done some rudimentary support for Unified Memory Architecture, but haven't had the time to look further into it.

Unfortunately AMD pulled the old bait-and-switch and nerfed this functionality for a large portion of their GPU lineup. Last I checked Steam was reporting the vast majority of systems with AMD GPUs fell into this camp.

I found an up-to 4000% decrease of performance just by switching from sycl::malloc_device() to sycl::malloc_shared() - even if all I do is repeatedly resubmitting the same SYCL kernel
most AMD discrete GPUs either disabled the XNACK feature by default, or unsupport it outright. Even though the silicon theoretically appears to have this capabilities since GFX8
https://stackoverflow.com/questions/76700305/4000-performance-decrease-in-sycl-when-using-unified-shared-memory-instead-of-d

While clblast wasn't the most performant it was very robust when it came to squeezing every last byte into VRAM. It handled overallocation seamlessly.
Vulkan is much more unforgiving and will crash my X server, requiring a forceful reloading of the amdgpu driver or a reboot.
HIP is even worse and over allocating will often require a hard reboot - as in fully powering down the system and letting the caps discharge

@shibe2
Copy link
Contributor

shibe2 commented Jun 8, 2024

Interesting. I was not using hipBLAS back-end because at that time it was wasting VRAM. But now it seems less memory-hungry.

As for power cycling being required, it may be because of a hardware or firmware problem. I used to have something like that with AMD GPU.

@Zibri
Copy link
Author

Zibri commented Jun 10, 2024

I think the main feature of llama.cpp is it's efficiency. clblast just added to it.
I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is.
please re-add clblast.

@0wwafa
Copy link

0wwafa commented Jun 12, 2024

I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.

I completely agree, @Zibri

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 12, 2024

I think the main feature of llama.cpp is it's efficiency. clblast just added to it. I tried other solutions like ollama, but none worked well. llama.cpp is still the best there is. please re-add clblast.

This is an open source project. Noone here is opposed to having the backend, but someone has to put in the work, keep it up to date and fix bugs. Noone has worked on the OpenCL backend for a long time now, that's why it was removed. If you wanna take over maintaining it, you're welcome to.

@MaggotHATE
Copy link
Contributor

Noone has worked on the OpenCL backend for a long time now, that's why it was removed

If it was completely broken by this point - sure, but it's not. In fact, with Clblast removed there's no other "universal" backend to test Vulkan against. AFAIK all other backends are platform-specific.

@shibe2
Copy link
Contributor

shibe2 commented Jun 12, 2024

You can add it back for yourself while it is not broken. When it breaks, someone will need to update it.

@MaggotHATE
Copy link
Contributor

You can add it back for yourself while it is not broken.

Which will create questions about implementation and doubts like "It's not a part of llama.cpp anymore."

All I'm saying is, removing Clblast gained llama.cpp nothing. This backend, while partially broken, wasn't preventing anything else from working and was sitting in it's own place - and yes, it's easy to add back for that exact reason.

@JohannesGaessler
Copy link
Collaborator

All I'm saying is, removing Clblast gained llama.cpp nothing.

Removing the backend resulted in a reduction in maintenance effort for developers. And it also removed a footgun for users which in turn reduces the number of bug reports and the amount of work developers have in terms of sifting through Github issues.

@MaggotHATE
Copy link
Contributor

in maintenance effort for developers

...but it wasn't maintained?

removed a footgun for users

A warning during compilation would be enough, to my mind - and issues could be automatically ignored in that case.

@JohannesGaessler
Copy link
Collaborator

...but it wasn't maintained?

You still need to invest some effort in order to not completely break it. As I previously said, the OpenCL code was crudely grafted to the CPU code.

@MaggotHATE
Copy link
Contributor

You still need to invest some effort in order to not completely break it

Ok, so it was preventing something from working (even if in future)? This changes perspective.

Still, it was removed before Vulkan is on par with it memory-wise. Like I mentioned before, this will cause issues in future.

@ccbadd
Copy link

ccbadd commented Jun 12, 2024

Don't you still have Kompute to test Vulkan against?

@MaggotHATE
Copy link
Contributor

Kompute

I'm not even sure it's still maintained - barely see any mentioning of it in PRs and commits since it was added. I remember it lacking partial offloading, and if it's still the case, it's pointless.

@0wwafa
Copy link

0wwafa commented Jun 13, 2024

all I know tis that it was working. and now it's not :(

@Abhishek8394
Copy link

For anyone still looking for opencl, I will try to maintain a fork at https://github.com/Abhishek8394/llama.cpp under the branch opencl. It is up-to-date with llama.cpp as of now i.e. commit 7b2f4a7.

Anyone who wants to help maintain this, is welcome to. Also if anyone can contribute automated builds of just opencl, that would be awesome too.

@MaggotHATE
Copy link
Contributor

@Abhishek8394 please note that ggml_backend_buffer_type_i struct in ggml-backend-impl.h was changed (it no longer needs to check if backend is fully supported or not), and Clblast will not compile unmodified. I've made a simple change for my own program, and it seems to work so far.

@0wwafa
Copy link

0wwafa commented Jun 18, 2024

@Abhishek8394 please note that ggml_backend_buffer_type_i struct in ggml-backend-impl.h was changed (it no longer needs to check if backend is fully supported or not), and Clblast will not compile unmodified. I've made a simple change for my own program, and it seems to work so far.

I sincerely hope he will reconsider and add it back.

@github-actions github-actions bot added the stale label Jul 19, 2024
Copy link
Contributor

github-actions bot commented Aug 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Aug 2, 2024
@Zibri
Copy link
Author

Zibri commented Aug 2, 2024

@ggerganov kobold.cpp still supports it. Why can't llama.cpp?

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 3, 2024

kobold.cpp still supports it. Why can't llama.cpp?

Will you maintain it?

@0wwafa
Copy link

0wwafa commented Aug 3, 2024

kobold.cpp still supports it. Why can't llama.cpp?

Will you maintain it?

kobol.cpp people are using it.. just get it from them, no?

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 4, 2024

kobold.cpp still supports it. Why can't llama.cpp?

Will you maintain it?

kobol.cpp people are using it.. just get it from them, no?

You can use koboldcpp if you want to use that. The thing is that @LostRuins is putting in the minimal amount of work to keep the CLBlast version running, but he's not actually maintaining the backend. If a feature doesn't work with CLBlast, it just gets disabled in that version of koboldcpp, instead of implementing it in the CLBlast backend. That is not how llama.cpp operates.

If you want to reintroduce it to llama.cpp, find someone to bring it up to date with the backend system in llama.cpp and to improve and maintain it over time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests