-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some minimal optimizations for CDNA #10498
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. See my comments about directly writing back 32 bit floats (which I assume would be possible). This could be done in a separate PR though so if you want we can just merge this as-is.
(I assume you've already tried alternative values for mmq y size and max x size.)
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
cublasComputeType_t cu_compute_type = CUBLAS_COMPUTE_16F; | ||
if(ggml_cuda_info().devices[ctx.device].cc == CC_CDNA) | ||
cu_compute_type = CUBLAS_COMPUTE_32F; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the computation is done as 32 bit floats anyways you should be able to get a bit more performance by writing back the results as 32 bit directly instead of writing back as 16 bit and then converting to 32 bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cublasComputeType_t cu_compute_type = CUBLAS_COMPUTE_16F; | |
if(ggml_cuda_info().devices[ctx.device].cc == CC_CDNA) | |
cu_compute_type = CUBLAS_COMPUTE_32F; | |
cublasComputeType_t cu_compute_type = CUBLAS_COMPUTE_16F; | |
if (ggml_cuda_info().devices[ctx.device].cc == CC_CDNA) { | |
cu_compute_type = CUBLAS_COMPUTE_32F; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the computation is done as 32 bit floats anyways you should be able to get a bit more performance by writing back the results as 32 bit directly instead of writing back as 16 bit and then converting to 32 bit.
I hacked this in and found no measurable difference, this is no surprise as the time spent in the the kernels of ggml_cuda_op_mul_mat_cublas and ggml_cuda_mul_mat_batched_cublas is minuscule compared to the time spend in other places. So its not worth the effort for now.
I fixed the nits and also completed the defines for various rocm arches so that we have values for all major generations supported by rocblas if someone wants to do some optimization work in the future. |
Yes, i was not able gain any major performance by doing that or come anywhere close to rocblas. ofc gcn and cdnas warp size is 64. im kinda surprised it works at all given that things like layer norm can easly be senstive to warp size. |
Some quick numbers: This PR
Master:
|
If I add Master:
PR:
|
I will merge this soon unless you also want to address the comment by @8XXD8 . |
added the equivalent of what @8XXD8 did to also speed up gcn |
* Add some minimal optimizations for CDNA * ggml_cuda: set launch bounds also for GCN as it helps there too
This pr adds some minimal optimizations for CDNA.
Mainly, the mmq kernels perform extreamly poorly on CDNA. One reason is that the compiler runs out of Architectural VGPRs and spills into Acc VGPRs, which is terrible for performance. MMQ also dosent make use of MFMA while rocblas can.
This therefore this pr makes ggml_cuda_should_use_mmq return false more often for CDNA (and VEGA20 as rocblas is faster here too).
To allow rocblas to use MFMA this pr also sets the compute type to 32bit. CDNA can not do 16Bit accumulation with MFMA and rocblas dose NOT give you higher precision than you asked for, even if this would result in better performance, thus we need to set
ROCBLAS_COMPUTE_32F on CNDA so that MFMA gets used.
This pr improves prompt processing by about 2x almost uniformly across a variety of batch and model sizes. rocprof2 sill has mmq's kernels taking >95% of the wall time and there would still be a huge amount to go for actually decent performance on CDNA/GCN.