-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust prompt processing thread count seperately from n_threads #2534
Conversation
I ran this on an 8vCPU instance (on a barely loaded 12 core/24 thread host) and inference speed plateaus at 5 threads, while prompt processing keeps improving as I increment the thread count all the way up to 8 threads. Unfortunately I don't have anything better to try this out on 😢 |
On 16-core Ryzen 9 5950X the prompt processing speed improves up to 16 threads. |
Ok I think this is ready for release (just be mindful that this is a API breaking change as it modifies |
With a 13900k (8+16 cores), prompt processing scales all the way up to 32 threads, while for generation 8 threads is best. |
Maybe add a fix to the processing threads count when using openblas? |
@klosax I don't really see a point in specifically adding a control for OpenBLAS when the default implementation already matches it performance wise. IMO we should focus our efforts on our homegrown matmul code that directly works on the quantized weights. Also OpenBLAS isn't the only supported multithreaded BLAS implementation and |
I didnt know. I that case we could drop support for OpenBLAS I guess. |
A few more details with the 13900k. During generation with 8 threads, the CPU pulls ~170W, and with 32 threads ~250W, despite being slower. So there is a significant advantage both in performance and power usage of being able to use a different number of threads for generation and prompt processing. However, I am not sure that changing the interface of In the long term, I think that it would be better to move the number of threads to
build: 1f0bccb (1007) |
Yeah that's definitely an option here. My goal isn't necessarily to break the API 😁, I just want to expose the
Wow that's a pretty significant 50% improvement with 32 thread prompt processing compared to 8 threads which has the fastest inference speed. |
One more data point in support of separate thread counts This is on a GCP t2d-standartd-32 instance. 32 Milan cors with SMT/HT turned off so 1 core = 1 physical core - https://cloud.google.com/compute/docs/general-purpose-machines#t2d_machines
This line from llama.cpp seems to explain the pp speed curve. Line 1841 in 1f0bccb
|
I agree that it would be better to move the logic for the number of threads to the application. |
@ggerganov I created an example with
Alternatively we could have another function in |
Thanks. This is still not great - sorry for the extra work. We should do what @slaren suggested and move this in I'm not worried about the API breaking. If anyone wants to give it a try is welcome, else I'll try to implement this soon |
Welp it looks like I closed this and threw out the fork as well. I'll keep working on this after |
I am adding this change in #3301 to avoid having two API changes shortly after each other. |
There are advantages to having a different thread count for prompt processing vs. inference, as the former is limited by CPU speed/thread scaling while the latter is limited by memory bandwidth. If you use GPU BLAS for prompt processing a lower thread count may also be desired. Here's one of my examples where 8 thread prompt processing is faster than 4 thread prompt processing, while inference speed remains the same regardless of thread count. I can't find those posts at the moment but I've heard other people running on 20+ core servers saying that prompt processing scales decently well.
Codewise this is a simple change, but it is a breaking change in that it modifies
llama_eval
so that we can pass inpp_threads
.I'm leaving this as a draft for now to get feedback, especially from people with big CPU-only servers which may benefit from this.Note that for now I only added support formain
,perplexity
and friends currently havepp_threads
set ton_threads
.Resolves #2498.