Skip to content

Commit

Permalink
llama : update logic for number of threads when using BLAS
Browse files Browse the repository at this point in the history
  • Loading branch information
ggerganov committed Sep 5, 2023
1 parent 9217721 commit 35938ee
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2942,7 +2942,12 @@ static bool llama_eval_internal(

// for big prompts, if BLAS is enabled, it is better to use only one thread
// otherwise, the threads are spin-lock waiting for the BLAS calls and are degrading the performance
n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;
// TODO: this is mostly important for Apple Silicon where CBLAS is still performing very well
// we still need some threads to process all non-mul_mat ops, but not too much to avoid interfering
// with the BLAS calls. need a better solution
if (N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas()) {
n_threads = std::min(4, n_threads);
}

struct ggml_tensor * res = gf->nodes[gf->n_nodes - 1];
struct ggml_tensor * embeddings = gf->nodes[gf->n_nodes - 2];
Expand Down

2 comments on commit 35938ee

@LostRuins
Copy link
Collaborator

@LostRuins LostRuins commented on 35938ee Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ggerganov and @slaren, I noticed this seems to limit the thread count when running OpenBLAS to 4.
Before this, when OpenBLAS was active, the thread count was always set to 1 instead.

for big prompts, if BLAS is enabled, it is better to use only one thread
otherwise, the threads are spin-lock waiting for the BLAS calls and are degrading the performance

just wanna check if this change was intentional for OpenBLAS, since it seems to be targeted at macOS/Accelerate users. Would I still be better off clamping it to 1 instead of 4 for OpenBLAS?

Thanks!

@ggerganov
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LostRuins Earlier in the function we do:

    int n_threads = n_tokens == 1 ? cparams.n_threads : cparams.n_threads_batch;

So if you pass:

n_threads == 8
n_threads_batch == 1

Then during prompt processing, it will start just 1 thread, while during text generation it will start 8 threads.
This should give you the best performance with the current implementation of the threads in ggml.
This is mostly relevant for macOS/Accelerate.

Regarding OpenBLAS, I haven't done tests recently, but I believe with quantized models you are generally better off without enabling OpenBLAS at all.
For F16 models, there might still be some benefit in using OpenBLAS, but probably not really significant.
So I would recommend to do some tests with and without OpenBLAS and see if there is any reason to use it.

Please sign in to comment.