Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU utilization rate is very low with WHISPER_CUBLAS=1 #1179

Closed
bobqianic opened this issue Aug 14, 2023 · 6 comments
Closed

GPU utilization rate is very low with WHISPER_CUBLAS=1 #1179

bobqianic opened this issue Aug 14, 2023 · 6 comments
Labels
enhancement New feature or request

Comments

@bobqianic
Copy link
Collaborator

bobqianic commented Aug 14, 2023

It seems that the CPU is working most of the time while the GPU is resting. There's still a lot of room for optimization.
A 27.8-minute audio takes 62.7 minutes to transcribe...

model: ggml-model-largev2.bin
parameters: -bs 5 -bo 5
audio: diffusion2023-07-03.wav 27.8 mins

image
image
image
@MichaelDays
Copy link

Could you paste the full text of the log output, startup to shutdown?

@jbrough
Copy link
Contributor

jbrough commented Aug 15, 2023

It's at least an order of magnitude slower than my M2 Air with CoreML. I'm trying with cuda on g4dn.xlarge

ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4

@ggerganov
Copy link
Owner

Yes, current cuBLAS support is quite rudimentary as we constantly keep moving data between the CPU and the GPU.
Better implementation with tensor offloading as in llama.cpp is possible, but don't have enough time to implement it yet

@dereklll
Copy link

Yes, current cuBLAS support is quite rudimentary as we constantly keep moving data between the CPU and the GPU. Better implementation with tensor offloading as in llama.cpp is possible, but don't have enough time to implement it yet

This should leave a lot of room for optimization of the whole process and hopefully help speed it up

@dereklll
Copy link

whisper_print_timings: load time = 1177.86 ms
whisper_print_timings: fallbacks = 3 p / 1 h
whisper_print_timings: mel time = 75.22 ms
whisper_print_timings: sample time = 127.27 ms / 142 runs ( 0.90 ms per run)
whisper_print_timings: encode time = 4798.87 ms / 7 runs ( 685.55 ms per run)
whisper_print_timings: decode time = 755.72 ms / 125 runs ( 6.05 ms per run)
whisper_print_timings: prompt time = 259.36 ms / 12 runs ( 21.61 ms per run)
whisper_print_timings: total time = 115156.91 ms

llama_print_timings: load time = 2547.77 ms
llama_print_timings: sample time = 1384.57 ms / 207 runs ( 6.69 ms per token, 149.51 tokens per second)
llama_print_timings: prompt eval time = 2709.93 ms / 326 tokens ( 8.31 ms per token, 120.30 tokens per second)
llama_print_timings: eval time = 10688.34 ms / 207 runs ( 51.63 ms per token, 19.37 tokens per second)
llama_print_timings: total time = 113960.37 ms

@bobqianic
Copy link
Collaborator Author

Fixed in #1472

@bobqianic bobqianic added the enhancement New feature or request label Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants