-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
llama : update logic for number of threads when using BLAS
- Loading branch information
Showing
1 changed file
with
6 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
35938ee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ggerganov and @slaren, I noticed this seems to limit the thread count when running OpenBLAS to 4.
Before this, when OpenBLAS was active, the thread count was always set to 1 instead.
just wanna check if this change was intentional for OpenBLAS, since it seems to be targeted at macOS/Accelerate users. Would I still be better off clamping it to 1 instead of 4 for OpenBLAS?
Thanks!
35938ee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LostRuins Earlier in the function we do:
So if you pass:
Then during prompt processing, it will start just 1 thread, while during text generation it will start 8 threads.
This should give you the best performance with the current implementation of the threads in
ggml
.This is mostly relevant for macOS/Accelerate.
Regarding OpenBLAS, I haven't done tests recently, but I believe with quantized models you are generally better off without enabling OpenBLAS at all.
For F16 models, there might still be some benefit in using OpenBLAS, but probably not really significant.
So I would recommend to do some tests with and without OpenBLAS and see if there is any reason to use it.