Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust prompt processing thread count seperately from n_threads #2534

Closed
wants to merge 19 commits into from
Closed

Adjust prompt processing thread count seperately from n_threads #2534

wants to merge 19 commits into from

Conversation

netrunnereve
Copy link
Collaborator

@netrunnereve netrunnereve commented Aug 6, 2023

There are advantages to having a different thread count for prompt processing vs. inference, as the former is limited by CPU speed/thread scaling while the latter is limited by memory bandwidth. If you use GPU BLAS for prompt processing a lower thread count may also be desired. Here's one of my examples where 8 thread prompt processing is faster than 4 thread prompt processing, while inference speed remains the same regardless of thread count. I can't find those posts at the moment but I've heard other people running on 20+ core servers saying that prompt processing scales decently well.

Codewise this is a simple change, but it is a breaking change in that it modifies llama_eval so that we can pass in pp_threads. I'm leaving this as a draft for now to get feedback, especially from people with big CPU-only servers which may benefit from this.

Note that for now I only added support for main, perplexity and friends currently have pp_threads set to n_threads.

usage: ./main [options]
...
  -t N, --threads N     number of threads to use during computation (default: 4)
  -ppt N, --pp-threads N
                        number of threads to use during prompt processing (default: 4)

Resolves #2498.

@netrunnereve
Copy link
Collaborator Author

I ran this on an 8vCPU instance (on a barely loaded 12 core/24 thread host) and inference speed plateaus at 5 threads, while prompt processing keeps improving as I increment the thread count all the way up to 8 threads. Unfortunately I don't have anything better to try this out on 😢

@ggerganov
Copy link
Owner

On 16-core Ryzen 9 5950X the prompt processing speed improves up to 16 threads.

@netrunnereve netrunnereve marked this pull request as ready for review August 9, 2023 02:30
@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Aug 9, 2023

Ok I think this is ready for release (just be mindful that this is a API breaking change as it modifies llama_eval). I don't plan on having simple support pp_threads for simplicity's sake. embd-input doesn't support it either and relies only on n_threads as I'm not set up to run these vision models and can't test the code.

@slaren
Copy link
Collaborator

slaren commented Aug 12, 2023

With a 13900k (8+16 cores), prompt processing scales all the way up to 32 threads, while for generation 8 threads is best.

@ggerganov ggerganov added performance Speed related topics high priority Very important issue labels Aug 13, 2023
@ggerganov ggerganov self-requested a review August 13, 2023 09:03
@klosax
Copy link
Contributor

klosax commented Aug 15, 2023

Maybe add a fix to the processing threads count when using openblas?
ref: ggerganov/ggml#452

@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Aug 15, 2023

@klosax I don't really see a point in specifically adding a control for OpenBLAS when the default implementation already matches it performance wise. IMO we should focus our efforts on our homegrown matmul code that directly works on the quantized weights.

Also OpenBLAS isn't the only supported multithreaded BLAS implementation and pp_threads would also need to control BLIS/etc. if we want to go that route.

@klosax
Copy link
Contributor

klosax commented Aug 15, 2023

@klosax I don't really see a point in specifically adding a control for OpenBLAS when the default implementation already matches it performance wise.

I didnt know. I that case we could drop support for OpenBLAS I guess.

@slaren
Copy link
Collaborator

slaren commented Aug 18, 2023

A few more details with the 13900k. During generation with 8 threads, the CPU pulls ~170W, and with 32 threads ~250W, despite being slower. So there is a significant advantage both in performance and power usage of being able to use a different number of threads for generation and prompt processing.

However, I am not sure that changing the interface of llama_eval is a good idea. The same can be achieved by the application by passing a different number of threads to llama_eval depending on n_tokens, without a breaking API change. Maybe we could update main and other examples to do this instead?

In the long term, I think that it would be better to move the number of threads to llama_context_params or similar, maybe with a function to change the value after initialization if needed. But that should be done at a later time, ideally together with other breaking API changes.

./llama-bench -t 4,8,16,24,32 -r 2

model backend n_threads test t/s
LLaMA 7B mostly Q4_0 CPU 4 pp 512 22.14 ± 1.31
LLaMA 7B mostly Q4_0 CPU 8 pp 512 41.02 ± 0.30
LLaMA 7B mostly Q4_0 CPU 16 pp 512 32.85 ± 0.48
LLaMA 7B mostly Q4_0 CPU 24 pp 512 47.05 ± 0.14
LLaMA 7B mostly Q4_0 CPU 32 pp 512 60.16 ± 0.34
LLaMA 7B mostly Q4_0 CPU 4 tg 128 12.85 ± 0.15
LLaMA 7B mostly Q4_0 CPU 8 tg 128 18.43 ± 0.40
LLaMA 7B mostly Q4_0 CPU 16 tg 128 15.48 ± 0.07
LLaMA 7B mostly Q4_0 CPU 24 tg 128 16.91 ± 0.03
LLaMA 7B mostly Q4_0 CPU 32 tg 128 17.02 ± 0.20

build: 1f0bccb (1007)

@netrunnereve
Copy link
Collaborator Author

However, I am not sure that changing the interface of llama_eval is a good idea. The same can be achieved by the application by passing a different number of threads to llama_eval depending on n_tokens, without a breaking API change. Maybe we could update main and other examples to do this instead?

In the long term, I think that it would be better to move the number of threads to llama_context_params or similar, maybe with a function to change the value after initialization if needed. But that should be done at a later time, ideally together with other breaking API changes.

Yeah that's definitely an option here. My goal isn't necessarily to break the API 😁, I just want to expose the pp_threads parameter to the user.

./llama-bench -t 4,8,16,24,32 -r 2

Wow that's a pretty significant 50% improvement with 32 thread prompt processing compared to 8 threads which has the fastest inference speed.

@kiratp
Copy link

kiratp commented Aug 19, 2023

One more data point in support of separate thread counts

This is on a GCP t2d-standartd-32 instance. 32 Milan cors with SMT/HT turned off so 1 core = 1 physical core - https://cloud.google.com/compute/docs/general-purpose-machines#t2d_machines

root@ml-perf-testing:/usr/src/app/llama.cpp# ./llama-bench --model /usr/src/models/<llama2 7B merged lora>.ggml.q4_k_m.bin -t 2,8,16,22,28,30,31,32
| model                          | backend    |  n_threads | test       |             t/s |
| ------------------------------ | ---------- | ---------: | ---------- | --------------: |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | pp 512     |    34.75 ± 1.77 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | pp 512     |    34.44 ± 1.71 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | pp 512     |    34.57 ± 1.11 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | pp 512     |    33.10 ± 0.98 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         28 | pp 512     |    29.34 ± 5.15 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         30 | pp 512     |    23.68 ± 1.64 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         31 | pp 512     |    25.66 ± 2.93 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | pp 512     |    32.49 ± 2.29 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | tg 128     |     4.43 ± 1.70 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | tg 128     |    17.72 ± 1.97 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | tg 128     |    21.51 ± 0.52 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | tg 128     |    21.18 ± 1.47 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         28 | tg 128     |    19.85 ± 0.79 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         30 | tg 128     |    17.71 ± 3.40 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         31 | tg 128     |    21.59 ± 1.88 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | tg 128     |    17.96 ± 4.88 |

build: 1f0bccb (1007)

This line from llama.cpp seems to explain the pp speed curve.

n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;

    // for big prompts, if BLAS is enabled, it is better to use only one thread
    // otherwise, the threads are spin-lock waiting for the BLAS calls and are degrading the performance
    n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;

@ggerganov
Copy link
Owner

However, I am not sure that changing the interface of llama_eval is a good idea. The same can be achieved by the application by passing a different number of threads to llama_eval depending on n_tokens, without a breaking API change. Maybe we could update main and other examples to do this instead?

I agree that it would be better to move the logic for the number of threads to the application.

@ggerganov ggerganov removed the high priority Very important issue label Aug 22, 2023
@netrunnereve
Copy link
Collaborator Author

@ggerganov I created an example with main only (8209b5d) where I basically reverted the API change and adjusted the thread count using something like this:

int eval_thr = n_eval > 1 ? params.pp_threads : params.n_threads;
if (llama_eval(ctx_guidance, input_buf + i, n_eval, n_past_guidance, eval_thr)) {

Alternatively we could have another function in llama.cpp like int llama_eval_ppt(struct llama_context * ctx, const llama_token * tokens, int n_tokens, int n_past, int n_threads, int pp_threads);, and apps could choose whether to call that or llama_eval. Let me know what you prefer and I'll update the rest and bring in GGUF.

@ggerganov
Copy link
Owner

Thanks. This is still not great - sorry for the extra work.

We should do what @slaren suggested and move this in llama_context_params.
We should also introduce llama_model_params as suggested here: #2620 (comment)

I'm not worried about the API breaking. If anyone wants to give it a try is welcome, else I'll try to implement this soon

@netrunnereve netrunnereve closed this by deleting the head repository Sep 20, 2023
@netrunnereve
Copy link
Collaborator Author

Welp it looks like I closed this and threw out the fork as well. I'll keep working on this after llama_model_params support is added.

@slaren
Copy link
Collaborator

slaren commented Sep 21, 2023

I am adding this change in #3301 to avoid having two API changes shortly after each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Controlling the number of threads for CLBlast/cuBLAS prompt processing
5 participants