-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal : support MTLGPUFamily < Apple7, formatting, style #3524
Conversation
M2 (10c GPU), it looks like the speed has some slow down in some cases:
build: 99ed03a (1343) |
Yes, I've also realized that the break-even point is not so trivial as currently proposed. It is function of the matrix sizes. |
Bummer. I can't figure out an universal way to determine which kernel to use when. The break-even point for the number of batches at which the matrix-matrix kernel becomes more performant than the matrix-vector kernel depends both on the specifics of the hardware which are not queryable (number of cores, memory bandwidth, FLOPs) and on the model / matrix sizes. Based on the tests here, there is a significant performance to be gained for quantized low-batch (< 16) decoding, which would be quite important for speculative approaches. But can't figure out a way to choose the optimal kernel. Any suggestions? |
Are there actually already two kernels for CUDA? I actually wanted to mess with this and see if it helped with my issue where parallel generation gets slower and slower but it didn't look like there was that kind of logic in
You didn't say "good suggestions". The simplest thing that comes to mind is to just do a mini benchmark that runs a few operations and stores the result into the context. Maybe that could even be part of the warmup stuff that already exists. There might be other stuff that could benefit from that sort of thing as well. |
CUDA has different kernels for matrix-vector and matrix-matrix multiplication. To do this, first the matrix-vector kernels would need to be updated to support matrix-matrix multiplication. |
fe6ef1c
to
6b9554a
Compare
Optionally, we can ask the user to pass in their hardware specs. In the docs, there's already a hint to set |
Edit:
ref #3129
The scope of this PR changed - it is now mostly a formatting change. Improved batched decoding will be investigated in a future PR.
Obsolete info below
ref #3479
In Metal, we have 2 matrix multiplication kernels:
Depending on the batch size, one of the 2 kernels is faster.
This PR adds logic for choosing which kernel to use depending on the batch size. The numbers are determined empirically on M2 Ultra. Not sure if these translate to the optimal numbers for other chips, but for sure would not affect the performance tests that we have been doing so far, since we have been testing either batch size of 1 or batch size of 512.
This change improves batched decoding performance for non-F16 types. For F16 there is no difference, although similar analysis should be performed on the CUDA kernels and see where is the break-even point between the 2 kernels
build: 99ed03a (1343)
Sample results for
parallel
exampleGenerating 64 sequences using a system prompt, serving 4 requests in parallel