Adjust prompt processing thread count seperately from n_threads #2534

netrunnereve · 2023-08-06T16:28:41Z

There are advantages to having a different thread count for prompt processing vs. inference, as the former is limited by CPU speed/thread scaling while the latter is limited by memory bandwidth. If you use GPU BLAS for prompt processing a lower thread count may also be desired. Here's one of my examples where 8 thread prompt processing is faster than 4 thread prompt processing, while inference speed remains the same regardless of thread count. I can't find those posts at the moment but I've heard other people running on 20+ core servers saying that prompt processing scales decently well.

Codewise this is a simple change, but it is a breaking change in that it modifies llama_eval so that we can pass in pp_threads. ~~I'm leaving this as a draft for now to get feedback, especially from people with big CPU-only servers which may benefit from this.~~

~~Note that for now I only added support for main, perplexity and friends currently have pp_threads set to n_threads.~~

usage: ./main [options]
...
  -t N, --threads N     number of threads to use during computation (default: 4)
  -ppt N, --pp-threads N
                        number of threads to use during prompt processing (default: 4)

Resolves #2498.

netrunnereve · 2023-08-07T01:12:28Z

I ran this on an 8vCPU instance (on a barely loaded 12 core/24 thread host) and inference speed plateaus at 5 threads, while prompt processing keeps improving as I increment the thread count all the way up to 8 threads. Unfortunately I don't have anything better to try this out on 😢

ggerganov · 2023-08-07T08:22:13Z

On 16-core Ryzen 9 5950X the prompt processing speed improves up to 16 threads.

netrunnereve · 2023-08-09T02:38:47Z

Ok I think this is ready for release (just be mindful that this is a API breaking change as it modifies llama_eval). I don't plan on having simple support pp_threads for simplicity's sake. embd-input doesn't support it either and relies only on n_threads as I'm not set up to run these vision models and can't test the code.

slaren · 2023-08-12T21:15:38Z

With a 13900k (8+16 cores), prompt processing scales all the way up to 32 threads, while for generation 8 threads is best.

klosax · 2023-08-15T16:44:06Z

Maybe add a fix to the processing threads count when using openblas?
ref: ggerganov/ggml#452

netrunnereve · 2023-08-15T21:44:48Z

@klosax I don't really see a point in specifically adding a control for OpenBLAS when the default implementation already matches it performance wise. IMO we should focus our efforts on our homegrown matmul code that directly works on the quantized weights.

Also OpenBLAS isn't the only supported multithreaded BLAS implementation and pp_threads would also need to control BLIS/etc. if we want to go that route.

klosax · 2023-08-15T22:15:14Z

@klosax I don't really see a point in specifically adding a control for OpenBLAS when the default implementation already matches it performance wise.

I didnt know. I that case we could drop support for OpenBLAS I guess.

slaren · 2023-08-18T22:35:42Z

A few more details with the 13900k. During generation with 8 threads, the CPU pulls ~170W, and with 32 threads ~250W, despite being slower. So there is a significant advantage both in performance and power usage of being able to use a different number of threads for generation and prompt processing.

However, I am not sure that changing the interface of llama_eval is a good idea. The same can be achieved by the application by passing a different number of threads to llama_eval depending on n_tokens, without a breaking API change. Maybe we could update main and other examples to do this instead?

In the long term, I think that it would be better to move the number of threads to llama_context_params or similar, maybe with a function to change the value after initialization if needed. But that should be done at a later time, ideally together with other breaking API changes.

./llama-bench -t 4,8,16,24,32 -r 2

model	backend	n_threads	test	t/s
LLaMA 7B mostly Q4_0	CPU	4	pp 512	22.14 ± 1.31
LLaMA 7B mostly Q4_0	CPU	8	pp 512	41.02 ± 0.30
LLaMA 7B mostly Q4_0	CPU	16	pp 512	32.85 ± 0.48
LLaMA 7B mostly Q4_0	CPU	24	pp 512	47.05 ± 0.14
LLaMA 7B mostly Q4_0	CPU	32	pp 512	60.16 ± 0.34
LLaMA 7B mostly Q4_0	CPU	4	tg 128	12.85 ± 0.15
LLaMA 7B mostly Q4_0	CPU	8	tg 128	18.43 ± 0.40
LLaMA 7B mostly Q4_0	CPU	16	tg 128	15.48 ± 0.07
LLaMA 7B mostly Q4_0	CPU	24	tg 128	16.91 ± 0.03
LLaMA 7B mostly Q4_0	CPU	32	tg 128	17.02 ± 0.20

build: 1f0bccb (1007)

netrunnereve · 2023-08-19T19:47:49Z

However, I am not sure that changing the interface of llama_eval is a good idea. The same can be achieved by the application by passing a different number of threads to llama_eval depending on n_tokens, without a breaking API change. Maybe we could update main and other examples to do this instead?

In the long term, I think that it would be better to move the number of threads to llama_context_params or similar, maybe with a function to change the value after initialization if needed. But that should be done at a later time, ideally together with other breaking API changes.

Yeah that's definitely an option here. My goal isn't necessarily to break the API 😁, I just want to expose the pp_threads parameter to the user.

./llama-bench -t 4,8,16,24,32 -r 2

Wow that's a pretty significant 50% improvement with 32 thread prompt processing compared to 8 threads which has the fastest inference speed.

kiratp · 2023-08-19T22:08:12Z

One more data point in support of separate thread counts

This is on a GCP t2d-standartd-32 instance. 32 Milan cors with SMT/HT turned off so 1 core = 1 physical core - https://cloud.google.com/compute/docs/general-purpose-machines#t2d_machines

root@ml-perf-testing:/usr/src/app/llama.cpp# ./llama-bench --model /usr/src/models/<llama2 7B merged lora>.ggml.q4_k_m.bin -t 2,8,16,22,28,30,31,32
| model                          | backend    |  n_threads | test       |             t/s |
| ------------------------------ | ---------- | ---------: | ---------- | --------------: |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | pp 512     |    34.75 ± 1.77 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | pp 512     |    34.44 ± 1.71 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | pp 512     |    34.57 ± 1.11 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | pp 512     |    33.10 ± 0.98 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         28 | pp 512     |    29.34 ± 5.15 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         30 | pp 512     |    23.68 ± 1.64 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         31 | pp 512     |    25.66 ± 2.93 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | pp 512     |    32.49 ± 2.29 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          2 | tg 128     |     4.43 ± 1.70 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |          8 | tg 128     |    17.72 ± 1.97 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         16 | tg 128     |    21.51 ± 0.52 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         22 | tg 128     |    21.18 ± 1.47 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         28 | tg 128     |    19.85 ± 0.79 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         30 | tg 128     |    17.71 ± 3.40 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         31 | tg 128     |    21.59 ± 1.88 |
| LLaMA 7B mostly Q4_K - Medium  | BLAS       |         32 | tg 128     |    17.96 ± 4.88 |

build: 1f0bccb (1007)

This line from llama.cpp seems to explain the pp speed curve.

llama.cpp/llama.cpp

Line 1841 in 1f0bccb

    
           n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;

    // for big prompts, if BLAS is enabled, it is better to use only one thread
    // otherwise, the threads are spin-lock waiting for the BLAS calls and are degrading the performance
    n_threads = N >= 32 && ggml_cpu_has_blas() && !ggml_cpu_has_gpublas() ? 1 : n_threads;

ggerganov · 2023-08-22T07:13:29Z

However, I am not sure that changing the interface of llama_eval is a good idea. The same can be achieved by the application by passing a different number of threads to llama_eval depending on n_tokens, without a breaking API change. Maybe we could update main and other examples to do this instead?

I agree that it would be better to move the logic for the number of threads to the application.

netrunnereve · 2023-08-25T01:03:11Z

@ggerganov I created an example with main only (8209b5d) where I basically reverted the API change and adjusted the thread count using something like this:

int eval_thr = n_eval > 1 ? params.pp_threads : params.n_threads;
if (llama_eval(ctx_guidance, input_buf + i, n_eval, n_past_guidance, eval_thr)) {

Alternatively we could have another function in llama.cpp like int llama_eval_ppt(struct llama_context * ctx, const llama_token * tokens, int n_tokens, int n_past, int n_threads, int pp_threads);, and apps could choose whether to call that or llama_eval. Let me know what you prefer and I'll update the rest and bring in GGUF.

ggerganov · 2023-08-25T11:57:02Z

Thanks. This is still not great - sorry for the extra work.

We should do what @slaren suggested and move this in llama_context_params.
We should also introduce llama_model_params as suggested here: #2620 (comment)

I'm not worried about the API breaking. If anyone wants to give it a try is welcome, else I'll try to implement this soon

netrunnereve · 2023-09-21T17:22:29Z

Welp it looks like I closed this and threw out the fork as well. I'll keep working on this after llama_model_params support is added.

slaren · 2023-09-21T22:58:17Z

I am adding this change in #3301 to avoid having two API changes shortly after each other.

netrunnereve added 6 commits August 5, 2023 22:39

test pp_threads

5f02218

builds fine

1de711d

add printout of pp_threads

590feea

only activate pp_threads for main for now

215e2f2

fix

ce6d86e

remove from llama_context_params

0480362

netrunnereve added 3 commits August 8, 2023 21:13

Merge branch 'ggerganov:master' into master

5624a29

perplexity only uses pp_threads

d854348

add pp_threads support to other files

be26777

netrunnereve marked this pull request as ready for review August 9, 2023 02:30

netrunnereve added 5 commits August 8, 2023 22:47

Update llama.cpp

193f295

Update README.md

3919e67

Update README.md

49f0bfd

manual merge with llama.cpp master

01f45e1

Merge branch 'ggerganov:master' into master

400dcce

ggerganov added performance Speed related topics high priority Very important issue labels Aug 13, 2023

ggerganov self-requested a review August 13, 2023 09:03

netrunnereve added 2 commits August 18, 2023 21:17

Merge branch 'ggerganov:master' into master

a129a31

lazy fix for llama-bench (runs without pp_threads support)

1c154e9

ggerganov removed the high priority Very important issue label Aug 22, 2023

netrunnereve added 3 commits August 23, 2023 23:46

manual merge

d50ccb0

pre gguf merge

471e469

revert llama_eval, create main example

8209b5d

netrunnereve closed this by deleting the head repository Sep 20, 2023

netrunnereve mentioned this pull request Sep 22, 2023

llama.cpp : split llama_context_params into model and context params #3301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust prompt processing thread count seperately from n_threads #2534

Adjust prompt processing thread count seperately from n_threads #2534

netrunnereve commented Aug 6, 2023 •

edited

Loading

netrunnereve commented Aug 7, 2023

ggerganov commented Aug 7, 2023

netrunnereve commented Aug 9, 2023 •

edited

Loading

slaren commented Aug 12, 2023

klosax commented Aug 15, 2023

netrunnereve commented Aug 15, 2023 •

edited

Loading

klosax commented Aug 15, 2023

slaren commented Aug 18, 2023

netrunnereve commented Aug 19, 2023

kiratp commented Aug 19, 2023 •

edited

Loading

ggerganov commented Aug 22, 2023

netrunnereve commented Aug 25, 2023

ggerganov commented Aug 25, 2023

netrunnereve commented Sep 21, 2023

slaren commented Sep 21, 2023

Adjust prompt processing thread count seperately from n_threads #2534

Adjust prompt processing thread count seperately from n_threads #2534

Conversation

netrunnereve commented Aug 6, 2023 • edited Loading

netrunnereve commented Aug 7, 2023

ggerganov commented Aug 7, 2023

netrunnereve commented Aug 9, 2023 • edited Loading

slaren commented Aug 12, 2023

klosax commented Aug 15, 2023

netrunnereve commented Aug 15, 2023 • edited Loading

klosax commented Aug 15, 2023

slaren commented Aug 18, 2023

netrunnereve commented Aug 19, 2023

kiratp commented Aug 19, 2023 • edited Loading

ggerganov commented Aug 22, 2023

netrunnereve commented Aug 25, 2023

ggerganov commented Aug 25, 2023

netrunnereve commented Sep 21, 2023

slaren commented Sep 21, 2023

netrunnereve commented Aug 6, 2023 •

edited

Loading

netrunnereve commented Aug 9, 2023 •

edited

Loading

netrunnereve commented Aug 15, 2023 •

edited

Loading

kiratp commented Aug 19, 2023 •

edited

Loading