whisper : add full CUDA and Metal offloading #1472

ggerganov · 2023-11-10T10:51:59Z

Build with:

# make
WHISPER_CUBLAS=1 make -j

# cmake
cmake -DWHISPER_CUBLAS=1 ../

Also, the convolution ops are now offloaded both with CUDA and Metal resulting in speed-up in the Encoder (#1473)
Credits and huge thanks to @FSSRepo: ggerganov/ggml#564

If you want to have some fun, try this:

# get the models:
./models/download-ggml-model.sh medium.en
wget https://huggingface.co/TheBloke/Llama-2-13B-GGUF/resolve/main/llama-2-13b.Q4_K_M.gguf -O ./models/llama-13b-v2-q4_k_m.gguf

---

# NVIDIA (16GB VRAM required)
WHISPER_CUBLAS=1 make -j talk-llama
./talk-llama -mw ./models/ggml-medium.en.bin -ml ./models/llama-13b-v2-q4_k_m.gguf -p "John" -t 8

---

# Apple (Metal)
make -j talk-llama
./talk-llama -mw ./models/ggml-medium.en.bin -ml ./models/llama-13b-v2-q4_k_m.gguf -p "John" -t 8

---

# Apple (CoreML + Metal)
./models/generate-coreml-model.sh medium.en
WHISPER_COREML=1 make -j talk-llama
./talk-llama -mw ./models/ggml-medium.en.bin -ml ./models/llama-13b-v2-q4_k_m.gguf -p "John" -t 8

Bench on V100 and M2 Ultra

./extra/bench-all.sh 
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 9.55 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
  64 x   64: Q4_0     2.6 GFLOPS (128 runs) | Q4_1     2.6 GFLOPS (128 runs)
  64 x   64: Q5_0     2.7 GFLOPS (128 runs) | Q5_1     2.7 GFLOPS (128 runs) | Q8_0     2.7 GFLOPS (128 runs)
  64 x   64: F16      2.8 GFLOPS (128 runs) | F32      2.8 GFLOPS (128 runs)
 128 x  128: Q4_0    19.6 GFLOPS (128 runs) | Q4_1    19.0 GFLOPS (128 runs)
 128 x  128: Q5_0    20.0 GFLOPS (128 runs) | Q5_1    19.4 GFLOPS (128 runs) | Q8_0    20.1 GFLOPS (128 runs)
 128 x  128: F16     20.1 GFLOPS (128 runs) | F32     20.4 GFLOPS (128 runs)
 256 x  256: Q4_0   100.4 GFLOPS (128 runs) | Q4_1   125.4 GFLOPS (128 runs)
 256 x  256: Q5_0   126.1 GFLOPS (128 runs) | Q5_1   124.7 GFLOPS (128 runs) | Q8_0   125.8 GFLOPS (128 runs)
 256 x  256: F16    126.3 GFLOPS (128 runs) | F32     83.7 GFLOPS (128 runs)
 512 x  512: Q4_0   418.4 GFLOPS (128 runs) | Q4_1   508.4 GFLOPS (128 runs)
 512 x  512: Q5_0   508.8 GFLOPS (128 runs) | Q5_1   481.6 GFLOPS (128 runs) | Q8_0   505.6 GFLOPS (128 runs)
 512 x  512: F16    493.2 GFLOPS (128 runs) | F32    432.7 GFLOPS (128 runs)
1024 x 1024: Q4_0  1824.2 GFLOPS (128 runs) | Q4_1  1828.2 GFLOPS (128 runs)
1024 x 1024: Q5_0  1782.9 GFLOPS (128 runs) | Q5_1  1658.3 GFLOPS (128 runs) | Q8_0  1627.1 GFLOPS (128 runs)
1024 x 1024: F16   1570.5 GFLOPS (128 runs) | F32   1326.5 GFLOPS (128 runs)
2048 x 2048: Q4_0  4511.6 GFLOPS (128 runs) | Q4_1  4620.4 GFLOPS (128 runs)
2048 x 2048: Q5_0  4580.0 GFLOPS (128 runs) | Q5_1  4445.6 GFLOPS (128 runs) | Q8_0  4302.6 GFLOPS (128 runs)
2048 x 2048: F16   3860.2 GFLOPS (128 runs) | F32   2686.4 GFLOPS (128 runs)
4096 x 4096: Q4_0  8142.4 GFLOPS ( 60 runs) | Q4_1  8071.0 GFLOPS ( 59 runs)
4096 x 4096: Q5_0  8094.0 GFLOPS ( 59 runs) | Q5_1  8068.9 GFLOPS ( 59 runs) | Q8_0  7546.4 GFLOPS ( 55 runs)
4096 x 4096: F16   6807.1 GFLOPS ( 50 runs) | F32   4301.2 GFLOPS ( 32 runs)

GPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	tiny	1	8.85	1.86	4.31	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	tiny-q5_0	1	8.54	1.37	4.19	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	tiny-q5_1	1	8.46	1.33	4.22	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	base	1	14.90	2.55	5.87	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	base-q5_0	1	15.56	1.82	6.37	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	base-q5_1	1	15.16	1.78	5.94	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	small	1	40.54	4.77	12.61	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	small-q5_0	1	41.37	3.32	13.87	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	small-q5_1	1	41.32	3.34	13.31	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	medium	1	105.45	10.40	28.88	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	medium-q5_0	1	107.67	6.46	30.69	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	medium-q5_1	1	108.00	6.89	30.81	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	large	1	172.67	16.00	45.24	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	large-q5_0	1	177.31	8.93	49.94	`9c1ddc7`
NVIDIA V100	Ubuntu	AVX2 BLAS CUDA	large-q5_1	1	177.64	8.81	49.76	`9c1ddc7`

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
M2 Ultra	MacOS 14.1	COREML METAL	tiny	4	7.74	1.38	3.40	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	tiny-q5_0	4	6.61	1.37	3.19	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	tiny-q5_1	4	7.32	1.39	3.03	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	base	4	12.51	2.00	4.61	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	base-q5_0	4	11.82	1.91	4.73	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	base-q5_1	4	11.62	1.94	4.79	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	small	4	32.00	3.92	12.12	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	small-q5_0	4	33.15	3.89	13.73	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	small-q5_1	4	33.28	3.91	13.64	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	medium	4	93.84	8.26	30.16	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	medium-q5_0	4	96.74	7.99	33.90	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	medium-q5_1	4	96.46	8.12	33.67	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	large	4	179.61	11.72	53.73	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	large-q5_0	4	185.15	11.77	62.17	`997f7cb`
M2 Ultra	MacOS 14.1	COREML METAL	large-q5_1	4	185.08	11.69	61.98	`997f7cb`

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
M2 Ultra	MacOS 14.1	METAL	tiny	4	12.47	1.37	3.08	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	tiny-q5_0	4	12.16	1.34	2.91	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	tiny-q5_1	4	12.46	1.37	2.93	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	tiny-q8_0	4	10.84	1.32	2.81	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	base	4	17.90	1.93	4.53	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	base-q5_0	4	19.77	1.93	4.71	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	base-q5_1	4	19.73	1.91	4.69	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	base-q8_0	4	18.83	1.89	4.63	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small	4	50.79	3.97	12.13	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q4_0	4	53.50	3.69	12.88	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q4_1	4	53.41	3.66	12.88	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q5_0	4	57.16	3.95	13.70	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q5_1	4	56.82	3.97	13.62	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	small-q8_0	4	53.14	3.73	12.97	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium	4	138.55	8.28	30.04	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q4_0	4	147.26	7.26	31.62	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q4_1	4	147.48	7.52	31.76	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q5_0	4	159.11	8.02	33.83	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q5_1	4	158.79	8.14	33.66	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	medium-q8_0	4	146.50	7.82	32.16	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large	4	247.72	11.71	53.67	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q4_0	4	263.48	10.62	57.08	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q4_1	4	262.32	10.56	57.09	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q5_0	4	285.42	11.84	62.21	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q5_1	4	284.08	11.65	62.00	`997f7cb`
M2 Ultra	MacOS 14.1	METAL	large-q8_0	4	262.82	11.29	57.51	`997f7cb`

* ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models

ggerganov · 2023-11-10T20:32:47Z

Looking for feedback both with CUDA and Metal - the performance should be significantly improved

slaren · 2023-11-10T20:46:24Z

I am not very familiar with whisper.cpp, but these are my results using bench with a 3090 Ti under WSL. Let me know if you want me to run any other test.

PR

whisper_print_timings:     load time =   424.84 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =     9.48 ms /     1 runs (    9.48 ms per run)
whisper_print_timings:   decode time =   447.13 ms /   256 runs (    1.75 ms per run)
whisper_print_timings:   prompt time =   168.82 ms /    16 runs (   10.55 ms per run)
whisper_print_timings:    total time =   625.45 ms

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
  64 x   64: Q4_0     3.9 GFLOPS (128 runs) | Q4_1     4.2 GFLOPS (128 runs)
  64 x   64: Q5_0     3.6 GFLOPS (128 runs) | Q5_1     3.4 GFLOPS (128 runs) | Q8_0     3.9 GFLOPS (128 runs)
  64 x   64: F16      3.8 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0    27.9 GFLOPS (128 runs) | Q4_1    28.0 GFLOPS (128 runs)
 128 x  128: Q5_0    28.2 GFLOPS (128 runs) | Q5_1    28.0 GFLOPS (128 runs) | Q8_0    25.1 GFLOPS (128 runs)
 128 x  128: F16     25.2 GFLOPS (128 runs) | F32     25.1 GFLOPS (128 runs)
 256 x  256: Q4_0   171.0 GFLOPS (128 runs) | Q4_1   171.8 GFLOPS (128 runs)
 256 x  256: Q5_0   169.0 GFLOPS (128 runs) | Q5_1   172.1 GFLOPS (128 runs) | Q8_0   158.9 GFLOPS (128 runs)
 256 x  256: F16    152.0 GFLOPS (128 runs) | F32    155.1 GFLOPS (128 runs)
 512 x  512: Q4_0   651.6 GFLOPS (128 runs) | Q4_1   660.8 GFLOPS (128 runs)
 512 x  512: Q5_0   666.6 GFLOPS (128 runs) | Q5_1   662.9 GFLOPS (128 runs) | Q8_0   647.6 GFLOPS (128 runs)
 512 x  512: F16    615.5 GFLOPS (128 runs) | F32    549.1 GFLOPS (128 runs)
1024 x 1024: Q4_0  1945.6 GFLOPS (128 runs) | Q4_1  1944.4 GFLOPS (128 runs)
1024 x 1024: Q5_0  1928.3 GFLOPS (128 runs) | Q5_1  1900.2 GFLOPS (128 runs) | Q8_0  1784.3 GFLOPS (128 runs)
1024 x 1024: F16   1664.1 GFLOPS (128 runs) | F32   1403.1 GFLOPS (128 runs)
2048 x 2048: Q4_0  3900.1 GFLOPS (128 runs) | Q4_1  3947.6 GFLOPS (128 runs)
2048 x 2048: Q5_0  3900.6 GFLOPS (128 runs) | Q5_1  3840.9 GFLOPS (128 runs) | Q8_0  3745.9 GFLOPS (128 runs)
2048 x 2048: F16   3400.8 GFLOPS (128 runs) | F32   2619.8 GFLOPS (128 runs)
4096 x 4096: Q4_0  7538.0 GFLOPS ( 55 runs) | Q4_1  7372.8 GFLOPS ( 54 runs)
4096 x 4096: Q5_0  7445.0 GFLOPS ( 55 runs) | Q5_1  7437.4 GFLOPS ( 55 runs) | Q8_0  7143.9 GFLOPS ( 53 runs)
4096 x 4096: F16   6546.9 GFLOPS ( 48 runs) | F32   4950.8 GFLOPS ( 37 runs)

Master

whisper_print_timings:     load time =   409.18 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =   274.91 ms /     1 runs (  274.91 ms per run)
whisper_print_timings:   decode time =   530.36 ms /   256 runs (    2.07 ms per run)
whisper_print_timings:   prompt time =  1032.40 ms /    16 runs (   64.52 ms per run)
whisper_print_timings:    total time =  1837.76 ms

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
  64 x   64: Q4_0     2.7 GFLOPS (128 runs) | Q4_1     3.2 GFLOPS (128 runs)
  64 x   64: Q5_0     3.6 GFLOPS (128 runs) | Q5_1     3.6 GFLOPS (128 runs) | Q8_0     3.6 GFLOPS (128 runs)
  64 x   64: F16      3.6 GFLOPS (128 runs) | F32      3.5 GFLOPS (128 runs)
 128 x  128: Q4_0    28.1 GFLOPS (128 runs) | Q4_1    27.7 GFLOPS (128 runs)
 128 x  128: Q5_0    29.2 GFLOPS (128 runs) | Q5_1    28.0 GFLOPS (128 runs) | Q8_0    24.0 GFLOPS (128 runs)
 128 x  128: F16     24.8 GFLOPS (128 runs) | F32     25.0 GFLOPS (128 runs)
 256 x  256: Q4_0   168.7 GFLOPS (128 runs) | Q4_1   171.0 GFLOPS (128 runs)
 256 x  256: Q5_0   171.0 GFLOPS (128 runs) | Q5_1   168.1 GFLOPS (128 runs) | Q8_0   156.1 GFLOPS (128 runs)
 256 x  256: F16    161.0 GFLOPS (128 runs) | F32    153.7 GFLOPS (128 runs)
 512 x  512: Q4_0   657.4 GFLOPS (128 runs) | Q4_1   655.8 GFLOPS (128 runs)
 512 x  512: Q5_0   635.9 GFLOPS (128 runs) | Q5_1   664.7 GFLOPS (128 runs) | Q8_0   647.8 GFLOPS (128 runs)
 512 x  512: F16    613.6 GFLOPS (128 runs) | F32    546.8 GFLOPS (128 runs)
1024 x 1024: Q4_0  1960.1 GFLOPS (128 runs) | Q4_1  1944.5 GFLOPS (128 runs)
1024 x 1024: Q5_0  1953.9 GFLOPS (128 runs) | Q5_1  1941.3 GFLOPS (128 runs) | Q8_0  1831.5 GFLOPS (128 runs)
1024 x 1024: F16   1674.1 GFLOPS (128 runs) | F32   1419.4 GFLOPS (128 runs)
2048 x 2048: Q4_0  4077.5 GFLOPS (128 runs) | Q4_1  3989.5 GFLOPS (128 runs)
2048 x 2048: Q5_0  3965.7 GFLOPS (128 runs) | Q5_1  3926.2 GFLOPS (128 runs) | Q8_0  3800.3 GFLOPS (128 runs)
2048 x 2048: F16   3417.4 GFLOPS (128 runs) | F32   2679.9 GFLOPS (128 runs)
4096 x 4096: Q4_0  7663.1 GFLOPS ( 56 runs) | Q4_1  7623.9 GFLOPS ( 56 runs)
4096 x 4096: Q5_0  7596.4 GFLOPS ( 56 runs) | Q5_1  7460.4 GFLOPS ( 55 runs) | Q8_0  7270.6 GFLOPS ( 53 runs)
4096 x 4096: F16   6621.9 GFLOPS ( 49 runs) | F32   5056.4 GFLOPS ( 37 runs)

ggerganov · 2023-11-10T20:54:07Z

Yup, the mul mat benchmark is not very relevant to this PR because it still copies the data to the GPU, performs the multiplication and copies the data back to the CPU. The changes here should not affect the performance of this test.

The bench-all.sh script will look for multilingual models in the models folder and bench them.
You can get the models by running:

./models/download-ggml-model.sh tiny
./models/download-ggml-model.sh base
./models/download-ggml-model.sh small
./models/download-ggml-model.sh medium
./models/download-ggml-model.sh large

bobqianic · 2023-11-10T21:15:49Z

Just tried out this PR on my RTX3060 mobile and it's incredibly fast. A 27-minute audio file was transcribed in just 25 seconds. Plus, the transcription quality is not degraded.

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
whisper_model_load: using CUDA backend
whisper_model_load:     CUDA buffer size =   149.41 MB
whisper_model_load: model size    =  149.33 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: compute buffer (conv)   =   14.11 MB
whisper_init_state: compute buffer (encode) =   81.95 MB
whisper_init_state: compute buffer (cross)  =    4.49 MB
whisper_init_state: compute buffer (decode) =   24.70 MB

system_info: n_threads = 4 / 20 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |

main: processing 'C:\Users\qianp\Downloads\whisper.cpp_build-fix\bin\Release\diffusion2023-07-03.wav' (26718958 samples, 1669.9 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: en (p = 0.969585)

[...]

whisper_print_timings:     load time =   833.29 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =  1076.23 ms
whisper_print_timings:   sample time =  2914.18 ms /  5311 runs (    0.55 ms per run)
whisper_print_timings:   encode time =  1623.05 ms /    65 runs (   24.97 ms per run)
whisper_print_timings:   decode time = 18438.33 ms /  5248 runs (    3.51 ms per run)
whisper_print_timings:   prompt time =   592.44 ms /    64 runs (    9.26 ms per run)
whisper_print_timings:    total time = 25668.54 ms

slaren · 2023-11-10T21:17:12Z

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
3090Ti	WSL	AVX2 BLAS CUDA	tiny	1	4.92	1.21	10.29	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	tiny-q5_0	1	5.22	1.04	10.82	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	tiny-q5_1	1	5.03	1.04	9.95	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	base	1	9.46	1.82	11.94	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	base-q5_0	1	9.31	1.50	12.30	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	base-q5_1	1	9.56	1.50	11.72	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	small	1	25.18	3.30	14.75	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	small-q5_0	1	29.09	2.74	15.25	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	small-q5_1	1	27.79	2.89	16.06	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	medium	1	71.69	6.82	25.11	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	medium-q5_0	1	72.64	5.37	29.76	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	medium-q5_1	1	74.90	5.45	28.03	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	large	1	120.91	9.73	37.65	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	large-q5_0	1	123.26	7.38	42.26	`3bfc43e`
3090Ti	WSL	AVX2 BLAS CUDA	large-q5_1	1	120.69	7.38	42.57	`3bfc43e`

Master

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
3090Ti	WSL	AVX2 BLAS	tiny	8	139.19	1.32	39.58	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	tiny-q5_0	8	141.64	0.62	36.45	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	tiny-q5_1	8	157.43	0.71	35.53	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	base	8	293.83	1.87	75.99	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	base-q5_0	8	273.09	1.19	72.18	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	base-q5_1	8	274.03	1.11	72.07	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	small	8	822.09	5.43	232.15	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	small-q5_0	8	826.08	3.13	206.77	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	small-q5_1	8	793.31	3.05	211.41	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	medium	8	2136.57	16.08	621.28	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	medium-q5_0	8	2074.32	8.77	562.22	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	medium-q5_1	8	2087.68	8.91	566.57	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	large	8	3585.90	30.75	1080.71	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	large-q5_0	8	3445.49	15.97	935.09	`ec7a6f0`
3090Ti	WSL	AVX2 BLAS	large-q5_1	8	3436.17	17.08	946.02	`ec7a6f0`

ggerganov · 2023-11-10T21:22:24Z

The quantize-all.sh script is broken yes

slaren · 2023-11-10T22:17:43Z

Under native Windows I get an out of memory error in ggml-alloc very rarely. This is probably related to some allocation returning an unaligned memory address, I will look more into it tomorrow.

whisper_init_from_file_with_params_no_state: loading model from './models/ggml-tiny-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 2
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
whisper_model_load: using CUDA backend
whisper_model_load:     CUDA buffer size =    34.59 MB
whisper_model_load: model size    =   34.53 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB
whisper_init_state: compute buffer (conv)   =   11.54 MB
whisper_init_state: compute buffer (encode) =   59.65 MB
whisper_init_state: compute buffer (cross)  =    3.76 MB
whisper_init_state: compute buffer (decode) =   18.92 MB

system_info: n_threads = 1 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
ggml_tallocr_alloc: not enough space in the buffer (needed 54000000, largest block available 51696128)
GGML_ASSERT: C:\CODE\whisper.cpp\ggml-alloc.c:116: !"not enough space in the buffer"

dreness · 2023-11-11T07:36:09Z

I see a notable improvement in encoder times from this PR - nice work :) I also noticed that with this PR, performance is pretty flat from 4 through 10 threads. With main @ ec7a6f0 there is a a bit of improvement for me up through 8 threads, but even at 8 threads it's slower than this PR.

main @ ec7a6f0

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
Apple M1 Max	14.1	NEON BLAS METAL	tiny	4	27.62	1.48	4.54	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	base	4	49.84	2.27	7.79	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	small	4	137.73	4.85	21.98	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	medium	4	360.75	9.94	57.96	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	large	4	633.79	15.24	101.62	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	6	25.15	1.55	4.58	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	base	6	45.27	2.31	7.83	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	small	6	127.70	4.93	22.11	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	medium	6	337.98	10.09	58.36	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	large	6	599.25	15.43	100.36	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	8	22.67	1.60	4.63	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	base	8	42.72	2.34	7.89	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	small	8	122.77	4.98	22.01	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	medium	8	337.37	9.95	58.24	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	large	8	588.09	15.55	101.77	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	10	27.00	1.66	4.63	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	base	10	45.32	2.44	7.96	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	small	10	127.46	5.02	22.31	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	medium	10	342.60	9.75	58.30	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	large	10	593.89	15.67	102.05	`ec7a6f0`

ggml-backend-no-sched @ 3bfc43e

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
Apple M1 Max	14.1	NEON BLAS METAL	tiny	4	20.03	1.52	4.54	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	base	4	38.58	2.21	7.72	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	small	4	115.09	4.89	22.05	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	medium	4	318.40	9.84	58.26	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	large	4	564.22	15.28	101.51	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	6	20.21	1.60	4.61	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	base	6	38.61	2.31	7.80	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	small	6	115.40	4.96	22.18	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	medium	6	318.59	10.12	58.47	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	large	6	564.97	15.36	101.84	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	8	20.30	1.63	4.63	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	base	8	38.82	2.37	7.93	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	small	8	115.64	4.96	22.21	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	medium	8	318.38	10.19	58.61	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	large	8	564.98	15.50	101.93	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	10	20.49	1.67	4.66	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	base	10	38.91	2.37	7.97	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	small	10	116.40	5.05	22.27	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	medium	10	318.70	10.23	58.60	`3bfc43e`
Apple M1 Max	14.1	NEON BLAS METAL	large	10	564.15	15.62	101.70	`3bfc43e`

ggerganov · 2023-11-11T11:10:15Z

whisper.cpp

 // TODO: check if other platforms can benefit from this optimization
+// TODO: CUDA is currently broken - seems ggml_mul_mat does not handle views correctly
 #if defined(GGML_USE_METAL)
 #define ggml_mul_mat ggml_mul_mat_pad
 #endif


The ggml_mul_mat_pad trick is very useful for the Metal kernels and provides significant improvement for the encoder.

Currently, this trick does not work with CUDA because we seem to have issues in some cases when the src are non-contiguous views. At the very least ggml_cuda_mul_mat_mat_batched_cublas does not handle all cases correctly for src1 being non-contiguous because ggml_get_to_fp16_cuda() assumes data without "holes" (i.e. contiguously-permuted), but there might be other issues as well.

We should keep this in mind and fix or assert properly

dreness · 2023-11-11T11:20:10Z

Figured I'd also include a comparison of this PR to main in benchmarks with 1 - 4 threads. Encoder times with ggml-backend-no-sched @ 0867e69 are still flat. I won't pretend to understand all the code, but this does feel like "no scheduling" to me :)

ggml-backend-no-sched @ 0867e69

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
Apple M1 Max	14.1	NEON BLAS METAL	tiny	1	19.73	1.45	4.51	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	base	1	38.25	2.20	7.78	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	small	1	114.96	4.89	21.79	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	medium	1	317.80	10.10	58.41	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	large	1	564.51	15.42	101.79	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	2	19.76	1.45	4.42	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	base	2	38.21	2.18	7.66	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	small	2	114.94	4.83	21.95	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	medium	2	317.65	10.00	58.29	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	large	2	564.53	15.00	100.43	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	3	19.56	1.47	4.39	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	base	3	37.86	2.18	7.59	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	small	3	115.23	4.84	21.95	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	medium	3	319.08	9.97	57.92	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	large	3	564.12	15.04	100.82	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	4	19.94	1.53	4.51	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	base	4	39.15	2.23	7.77	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	small	4	115.17	4.85	22.02	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	medium	4	317.77	9.93	58.31	`0867e69`
Apple M1 Max	14.1	NEON BLAS METAL	large	4	564.59	15.24	101.59	`0867e69`

main @ ec7a6f0

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
Apple M1 Max	14.1	NEON BLAS METAL	tiny	1	54.36	1.45	4.42	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	base	1	95.79	2.24	7.71	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	small	1	230.22	4.87	21.98	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	medium	1	506.86	10.02	58.11	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	large	1	871.68	15.16	99.93	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	2	35.83	1.48	4.45	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	base	2	63.46	2.22	7.74	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	small	2	169.18	4.78	21.72	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	medium	2	408.08	9.97	56.43	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	large	2	708.51	15.21	101.22	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	3	30.43	1.49	4.47	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	base	3	54.36	2.24	7.59	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	small	3	147.03	4.76	21.95	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	medium	3	372.93	9.94	57.87	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	large	3	653.22	15.23	101.46	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	tiny	4	27.01	1.52	4.52	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	base	4	49.59	2.22	7.70	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	small	4	136.85	4.84	21.97	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	medium	4	355.03	9.94	57.95	`ec7a6f0`
Apple M1 Max	14.1	NEON BLAS METAL	large	4	626.61	15.20	101.43	`ec7a6f0`

ggerganov · 2023-11-11T11:33:07Z

Nice plot! Yeah, on master, small part of the Encoder (the 2 convolutions + GELU activations) where performed on the CPU because we didn't have the necessary Metal kernels. With some help recently by @FSSRepo, we now have the kernels both for Metal and CUDA, so with this PR, no computation is done on the CPU anymore and the performance should not depend on the number of threads.

ggerganov · 2023-11-11T14:55:21Z

looks like sometimes different buffers may be allocated in the same address and that can confuse ggml-alloc

Let me know if I can help debug this somehow. Haven't been able to reproduce with Linux and MacOS yet.

slaren · 2023-11-11T15:08:07Z

The issue is that the encoder graph uses tensors from a previous graph. During measure, these tensors are allocated in a measure buffer which has already been freed (when the measure allocator was freed). Sometimes, malloc will return the same address for the encode measure buffer as for the measure buffer used in the previous graph. This causes ggml-alloc to think that these tensors are from the same buffer and tries to reuse their memory. As a result, when that happens the encode buffer size is measured to be smaller.

A workaround would be to keep the same measure allocators alive until all the graphs have been measured, and only then reallocate the buffers and allocators with the correct sizes. I suppose that whisper.cpp is using freed tensors so it's not unreasonable to consider this "undefined behavior", but practically this is not a good limitation to have, so I want to fix this in ggml-alloc/ggml-backend by allowing the same buffers to be reallocated, but that's not going to be a quick fix.

slaren · 2023-11-11T15:17:28Z

This doesn't fix that issue, but while looking into this I also found other problems:

diff --git a/whisper.cpp b/whisper.cpp
index eb69f96..a786593 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -636,12 +636,11 @@ static void whisper_allocr_graph_init(struct whisper_allocr & allocr, ggml_backe
     auto & meta   = allocr.meta;
     auto & buffer = allocr.buffer;

-    const int tensor_alignment = ggml_backend_get_alignment(backend);
-    alloc = ggml_allocr_new_measure(tensor_alignment);
+    alloc = ggml_allocr_new_measure_from_backend(backend);

     meta.resize(ggml_tensor_overhead()*WHISPER_MAX_NODES + ggml_graph_overhead());

-    const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph()) + tensor_alignment;
+    const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph());

     ggml_allocr_free(alloc);

@@ -1284,7 +1283,7 @@ static bool whisper_model_load(struct whisper_model_loader * loader, whisper_con

         // initialize the backends
 #ifdef GGML_USE_CUBLAS
-        if (wctx.params.use_gpu > 0) {
+        if (wctx.params.use_gpu) {
             WHISPER_LOG_INFO("%s: using CUDA backend\n", __func__);
             backend_gpu = ggml_backend_cuda_init();
             if (!backend_gpu) {

ggerganov · 2023-11-11T15:50:11Z

A workaround would be to keep the same measure allocators alive until all the graphs have been measured

Ok, I'll try to apply this. If it is a quick fix, feel free to apply it here since I don't have a Windows machine to test with.

I also realized another issue - the -p option can be used to split an audio file in chunks and process those chunks in parallel with multiple whisper_state instances. Currently, the states share the same backend instance which is stored in ~~whisper_model~~ whisper_context. But this is not thread-safe because during ggml_backend_graph_compute() the same backend work buffer will be used by all states.

I plan to create a new backend instance for each new whisper_state, while also keeping the backend instance in ~~whisper_model~~ whisper_context for creating the buffer holding the model tensors. Does this sound ok?

slaren · 2023-11-11T16:01:35Z

Yes, that should work. I also realized that this would be an issue in llama.cpp when creating multiple llama_context of a llama_model. My conclusion is that buffers need to be decoupled from the backend instances, but that's a bigger change.

slaren · 2023-11-11T16:25:19Z

This should fix the issue with MSVC:

diff --git a/whisper.cpp b/whisper.cpp
index d16492c..471d9a8 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -642,7 +642,7 @@ struct whisper_allocr {
 };

 static size_t whisper_allocr_size(struct whisper_allocr & allocr) {
-    return allocr.meta.size() + ggml_backend_buffer_get_size(allocr.buffer);
+    return allocr.meta.size() + ggml_allocr_max_size(allocr.alloc);
 }

 // measure the memory usage of a graph and prepare the allocr's internal data buffer
@@ -655,12 +655,19 @@ static void whisper_allocr_graph_init(struct whisper_allocr & allocr, ggml_backe

     meta.resize(ggml_tensor_overhead()*WHISPER_MAX_NODES + ggml_graph_overhead());

-    const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph());
+    ggml_allocr_alloc_graph(alloc, get_graph());
+}
+
+static void whisper_allocr_graph_realloc(struct whisper_allocr & allocr, ggml_backend_t backend) {
+    auto & alloc  = allocr.alloc;
+    auto & buffer = allocr.buffer;
+
+    size_t size = ggml_allocr_max_size(alloc);

     ggml_allocr_free(alloc);

-    buffer = ggml_backend_alloc_buffer(backend, alloc_size);
-    alloc  = ggml_allocr_new_from_buffer(buffer);
+    buffer = ggml_backend_alloc_buffer(backend, size);
+    alloc = ggml_allocr_new_from_buffer(buffer);
 }

 static void whisper_allocr_free(struct whisper_allocr & allocr) {
@@ -2915,6 +2922,11 @@ struct whisper_state * whisper_init_state(whisper_context * ctx) {
         WHISPER_LOG_INFO("%s: compute buffer (decode) = %7.2f MB\n", __func__, whisper_allocr_size(state->alloc_decode) / 1024.0 / 1024.0);
     }

+    whisper_allocr_graph_realloc(state->alloc_conv, ctx->backend);
+    whisper_allocr_graph_realloc(state->alloc_encode, ctx->backend);
+    whisper_allocr_graph_realloc(state->alloc_cross, ctx->backend);
+    whisper_allocr_graph_realloc(state->alloc_decode, ctx->backend);
+
     state->rng = std::mt19937(0);

     return state;

Native windows bench:

CPU	OS	Config	Model	Th	Enc.	Dec.	PP	Commit
3090Ti	Win11	AVX2 BLAS CUDA	tiny	1	5.31	1.36	8.64	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	tiny-q5_0	1	5.09	1.13	8.06	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	tiny-q5_1	1	5.15	1.14	8.83	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	base	1	9.39	1.90	9.15	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	base-q5_0	1	9.54	1.59	9.85	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	base-q5_1	1	9.53	1.58	9.94	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	small	1	25.58	3.72	13.55	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	small-q5_0	1	26.21	2.96	14.32	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	small-q5_1	1	26.11	2.96	14.48	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	medium	1	69.94	7.68	23.45	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	medium-q5_0	1	71.93	5.78	25.80	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	medium-q5_1	1	71.91	5.77	25.67	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	large	1	116.59	10.81	33.64	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	large-q5_0	1	120.39	7.74	37.87	`fc8565d`
3090Ti	Win11	AVX2 BLAS CUDA	large-q5_1	1	119.76	7.69	37.96	`fc8565d`

ggerganov · 2023-11-11T16:34:03Z

Thanks. The backend fix seems to work for the CPU, but it breaks with Metal because each backend (i.e. ggml_metal_context) keeps track of the associated buffers and they can end up in different backend instances. Will be looking tomorrow to find a fix for this

bobqianic · 2023-11-12T00:13:57Z

It appears that the setup-qemu-action is experiencing problems, impacting a significant portion of our CI testing. docker/setup-qemu-action#110

* whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state

* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggerganov#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggerganov#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state

100tomer · 2023-11-22T17:16:12Z

In my testing on m1 pro its slower on GPU compared to 8/10 threads cpu. Does this make any sense?
I tested converting 2 audio files and on cpu it was 1:12 and on GPU like 3 minutes...
Also I made sure it does using the GPU its on 100% usage...

* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggerganov#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggerganov#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state

ggerganov added 11 commits November 10, 2023 10:54

whisper : migrate to ggml-backend

6597573

whisper : fix logit reading

7e01486

whisper : fix tensor allocation during load

3dfbe64

whisper : fix beam-search with CUDA

dcf9511

whisper : free backends + fix compile warning

1203035

whisper : print when CUDA is enabled

3f5c1b7

Merge branch 'master' into ggml-backend-no-sched

0ab5025

whisper : fix CoreML

a54d8c9

make : clean-up

d6dad64

Merge branch 'master' into ggml-backend-no-sched

728e178

talk : fix compile warning

c99e290

ggerganov mentioned this pull request Nov 10, 2023

whisper : support ggml_conv with CUDA and Metal #1473

Merged

ggerganov added 2 commits November 10, 2023 22:26

whisper : support ggml_conv with CUDA and Metal (#1473)

933c5be

* ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models

whisper : clean-up

f53e138

ggerganov changed the title ~~whisper : add full CUDA offloading~~ whisper : add full CUDA and Metal offloading Nov 10, 2023

quantize-all : fix

3bfc43e

ggerganov added 3 commits November 11, 2023 10:41

ggml : im2col opts

66bb2e9

whisper : avoid whisper_model_data wrapper

0867e69

whisper : add note that ggml_mul_mat_pad does not work with CUDA

b27726d

ggerganov commented Nov 11, 2023

View reviewed changes

This comment was marked as outdated.

Sign in to view

whisper : factor out graph compute in common function

b618229

whisper : fixes

fc8565d

whisper : fix UB with measure buffers

40c6603

ggerganov mentioned this pull request Nov 11, 2023

whisper : try to fix the parallel whisper_state functionality #1479

Merged

3 tasks

whisper : try to fix the parallel whisper_state functionality (#1479)

5031f54

* whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state

ggerganov merged commit b050283 into master Nov 12, 2023
68 of 72 checks passed

This was referenced Nov 12, 2023

GPU utilization rate is very low with WHISPER_CUBLAS=1 #1179

Closed

Looks it now working on my gpu？ what wrong？ #1004

Closed

cebtenzzre mentioned this pull request Nov 28, 2023

CUDA an illegal memory access was encountered #1502

Open

iandundas mentioned this pull request Nov 30, 2023

Performance on Intel + macOS became 2-4 times slower since commit b050283 #1581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : add full CUDA and Metal offloading #1472

whisper : add full CUDA and Metal offloading #1472

ggerganov commented Nov 10, 2023 •

edited

Loading

ggerganov commented Nov 10, 2023

slaren commented Nov 10, 2023

ggerganov commented Nov 10, 2023

bobqianic commented Nov 10, 2023

slaren commented Nov 10, 2023 •

edited

Loading

ggerganov commented Nov 10, 2023

slaren commented Nov 10, 2023 •

edited

Loading

dreness commented Nov 11, 2023

ggerganov Nov 11, 2023 •

edited

Loading

dreness commented Nov 11, 2023

ggerganov commented Nov 11, 2023

This comment was marked as outdated.

ggerganov commented Nov 11, 2023

slaren commented Nov 11, 2023

slaren commented Nov 11, 2023

ggerganov commented Nov 11, 2023 •

edited

Loading

slaren commented Nov 11, 2023

slaren commented Nov 11, 2023

ggerganov commented Nov 11, 2023

bobqianic commented Nov 12, 2023

100tomer commented Nov 22, 2023

whisper : add full CUDA and Metal offloading #1472

whisper : add full CUDA and Metal offloading #1472

Conversation

ggerganov commented Nov 10, 2023 • edited Loading

ggerganov commented Nov 10, 2023

slaren commented Nov 10, 2023

ggerganov commented Nov 10, 2023

bobqianic commented Nov 10, 2023

slaren commented Nov 10, 2023 • edited Loading

ggerganov commented Nov 10, 2023

slaren commented Nov 10, 2023 • edited Loading

dreness commented Nov 11, 2023

ggerganov Nov 11, 2023 • edited Loading

Choose a reason for hiding this comment

dreness commented Nov 11, 2023

ggerganov commented Nov 11, 2023

This comment was marked as outdated.

ggerganov commented Nov 11, 2023

slaren commented Nov 11, 2023

slaren commented Nov 11, 2023

ggerganov commented Nov 11, 2023 • edited Loading

slaren commented Nov 11, 2023

slaren commented Nov 11, 2023

ggerganov commented Nov 11, 2023

bobqianic commented Nov 12, 2023

100tomer commented Nov 22, 2023

ggerganov commented Nov 10, 2023 •

edited

Loading

slaren commented Nov 10, 2023 •

edited

Loading

slaren commented Nov 10, 2023 •

edited

Loading

ggerganov Nov 11, 2023 •

edited

Loading

ggerganov commented Nov 11, 2023 •

edited

Loading