GPU utilization rate is very low with `WHISPER_CUBLAS=1` #1179

bobqianic · 2023-08-14T07:47:10Z

It seems that the CPU is working most of the time while the GPU is resting. There's still a lot of room for optimization.
A 27.8-minute audio takes 62.7 minutes to transcribe...

model: ggml-model-largev2.bin
parameters: -bs 5 -bo 5
audio: diffusion2023-07-03.wav 27.8 mins

The text was updated successfully, but these errors were encountered:

MichaelDays · 2023-08-14T10:35:07Z

Could you paste the full text of the log output, startup to shutdown?

jbrough · 2023-08-15T21:52:10Z

It's at least an order of magnitude slower than my M2 Air with CoreML. I'm trying with cuda on g4dn.xlarge

ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4

ggerganov · 2023-08-25T14:41:56Z

Yes, current cuBLAS support is quite rudimentary as we constantly keep moving data between the CPU and the GPU.
Better implementation with tensor offloading as in llama.cpp is possible, but don't have enough time to implement it yet

dereklll · 2023-09-20T06:08:02Z

Yes, current cuBLAS support is quite rudimentary as we constantly keep moving data between the CPU and the GPU. Better implementation with tensor offloading as in llama.cpp is possible, but don't have enough time to implement it yet

This should leave a lot of room for optimization of the whole process and hopefully help speed it up

dereklll · 2023-09-20T06:08:56Z

whisper_print_timings: load time = 1177.86 ms
whisper_print_timings: fallbacks = 3 p / 1 h
whisper_print_timings: mel time = 75.22 ms
whisper_print_timings: sample time = 127.27 ms / 142 runs ( 0.90 ms per run)
whisper_print_timings: encode time = 4798.87 ms / 7 runs ( 685.55 ms per run)
whisper_print_timings: decode time = 755.72 ms / 125 runs ( 6.05 ms per run)
whisper_print_timings: prompt time = 259.36 ms / 12 runs ( 21.61 ms per run)
whisper_print_timings: total time = 115156.91 ms

llama_print_timings: load time = 2547.77 ms
llama_print_timings: sample time = 1384.57 ms / 207 runs ( 6.69 ms per token, 149.51 tokens per second)
llama_print_timings: prompt eval time = 2709.93 ms / 326 tokens ( 8.31 ms per token, 120.30 tokens per second)
llama_print_timings: eval time = 10688.34 ms / 207 runs ( 51.63 ms per token, 19.37 tokens per second)
llama_print_timings: total time = 113960.37 ms

bobqianic · 2023-11-12T16:52:50Z

Fixed in #1472

jhen0409 mentioned this issue Oct 5, 2023

whisper : add context param for disable gpu #1293

Merged

bobqianic closed this as completed Nov 12, 2023

bobqianic added the enhancement New feature or request label Nov 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU utilization rate is very low with `WHISPER_CUBLAS=1` #1179

GPU utilization rate is very low with `WHISPER_CUBLAS=1` #1179

bobqianic commented Aug 14, 2023 •

edited

Loading

MichaelDays commented Aug 14, 2023

jbrough commented Aug 15, 2023 •

edited

Loading

ggerganov commented Aug 25, 2023

dereklll commented Sep 20, 2023

dereklll commented Sep 20, 2023

bobqianic commented Nov 12, 2023

GPU utilization rate is very low with WHISPER_CUBLAS=1 #1179

GPU utilization rate is very low with WHISPER_CUBLAS=1 #1179

Comments

bobqianic commented Aug 14, 2023 • edited Loading

MichaelDays commented Aug 14, 2023

jbrough commented Aug 15, 2023 • edited Loading

ggerganov commented Aug 25, 2023

dereklll commented Sep 20, 2023

dereklll commented Sep 20, 2023

bobqianic commented Nov 12, 2023

GPU utilization rate is very low with `WHISPER_CUBLAS=1` #1179

GPU utilization rate is very low with `WHISPER_CUBLAS=1` #1179

bobqianic commented Aug 14, 2023 •

edited

Loading

jbrough commented Aug 15, 2023 •

edited

Loading