-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ggml_syncthreads() #7455
base: master
Are you sure you want to change the base?
Conversation
ggerganov#6915)" This reverts commit e1b40ac.
Using an atomic to delegate slices of the matrix to separate threads is slow, because all the threads have to contend for the same memory spot. The right thing to do here is use a `chore` variable, where all threads perform the same computation independently. This change introduces the ggml_once() and ggml_syncthreads() functions which works the exact same way as CUDA. This is nice, since it means if BLAS or LLAMAFILE doesn't need `B` requantized, then it can skip paying for the synchronization barrier between the INIT and COMPUTE phases. We can further refactor along this path, to remove all the INIT/COMP/FINAL code too. All ops can should be in charge of their own synchronization.
This change makes inference go ~5% faster for me.
On M2 Ultra I get these results: ./scripts/compare-commits.sh master pr/7455 \
-m models/mistral-7b-v0.2/ggml-model-fp16.gguf \
-m models/mistral-7b-v0.2/ggml-model-q8_0.gguf \
-m models/mistral-7b-v0.2/ggml-model-q4_0.gguf -t 16 -ngl 0
Which model did you use to benchmark the performance? |
This change doesn't move the needle on ARM. I'm only seeing speedups on x86. I notice the biggest gains with really tiny models where synchronization actually is a noticeable bottleneck in matrix multiplication. For example, prompt processing with a 265mb embedding model:
In this case the gain can be up to 35%. |
The point of 6915 is that not all tasks will finish at the same speed, especially on larger models. If there's a situation that it doesn't handle well, find the comment that starts with I do like what you did with |
This change is an alternative proposal to #1507 and #6915. This pull request contains three commits, each detailing the three main synchronization tricks I used to make inference go ~6% faster in llamafile, while speeding up prompt processing too.
This change only dips a toe in the water so to speak, by introducing CUDA-style primitives for CPU mode. You can actually go much further in this technical direction. If you do, I'll be very grateful, since it'll mean GGML is more aligned with llamafile and therefore much easier for me to import changes.