-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU utilization rate is very low with WHISPER_CUBLAS=1
#1179
Comments
Could you paste the full text of the log output, startup to shutdown? |
It's at least an order of magnitude slower than my M2 Air with CoreML. I'm trying with cuda on g4dn.xlarge
|
Yes, current cuBLAS support is quite rudimentary as we constantly keep moving data between the CPU and the GPU. |
This should leave a lot of room for optimization of the whole process and hopefully help speed it up |
whisper_print_timings: load time = 1177.86 ms llama_print_timings: load time = 2547.77 ms |
Fixed in #1472 |
It seems that the CPU is working most of the time while the GPU is resting. There's still a lot of room for optimization.
A 27.8-minute audio takes 62.7 minutes to transcribe...
model:
ggml-model-largev2.bin
parameters:
-bs 5
-bo 5
audio:
diffusion2023-07-03.wav
27.8 mins
The text was updated successfully, but these errors were encountered: