-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : add full CUDA and Metal offloading #1472
Conversation
* ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models
Looking for feedback both with CUDA and Metal - the performance should be significantly improved |
I am not very familiar with whisper.cpp, but these are my results using PR
Master
|
Yup, the mul mat benchmark is not very relevant to this PR because it still copies the data to the GPU, performs the multiplication and copies the data back to the CPU. The changes here should not affect the performance of this test. The ./models/download-ggml-model.sh tiny
./models/download-ggml-model.sh base
./models/download-ggml-model.sh small
./models/download-ggml-model.sh medium
./models/download-ggml-model.sh large |
Just tried out this PR on my RTX3060 mobile and it's incredibly fast. A 27-minute audio file was transcribed in just 25 seconds. Plus, the transcription quality is not degraded.
|
Master
|
The |
Under native Windows I get an out of memory error in ggml-alloc very rarely. This is probably related to some allocation returning an unaligned memory address, I will look more into it tomorrow. whisper_init_from_file_with_params_no_state: loading model from './models/ggml-tiny-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51865
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 384
whisper_model_load: n_text_head = 6
whisper_model_load: n_text_layer = 4
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 8
whisper_model_load: qntvr = 2
whisper_model_load: type = 1 (tiny)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
whisper_model_load: using CUDA backend
whisper_model_load: CUDA buffer size = 34.59 MB
whisper_model_load: model size = 34.53 MB
whisper_init_state: kv self size = 2.62 MB
whisper_init_state: kv cross size = 8.79 MB
whisper_init_state: compute buffer (conv) = 11.54 MB
whisper_init_state: compute buffer (encode) = 59.65 MB
whisper_init_state: compute buffer (cross) = 3.76 MB
whisper_init_state: compute buffer (decode) = 18.92 MB
system_info: n_threads = 1 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 |
ggml_tallocr_alloc: not enough space in the buffer (needed 54000000, largest block available 51696128)
GGML_ASSERT: C:\CODE\whisper.cpp\ggml-alloc.c:116: !"not enough space in the buffer" |
I see a notable improvement in encoder times from this PR - nice work :) I also noticed that with this PR, performance is pretty flat from 4 through 10 threads. With main @ ec7a6f0 there is a a bit of improvement for me up through 8 threads, but even at 8 threads it's slower than this PR. main @ ec7a6f0
ggml-backend-no-sched @ 3bfc43e
|
// TODO: check if other platforms can benefit from this optimization | ||
// TODO: CUDA is currently broken - seems ggml_mul_mat does not handle views correctly | ||
#if defined(GGML_USE_METAL) | ||
#define ggml_mul_mat ggml_mul_mat_pad | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ggml_mul_mat_pad
trick is very useful for the Metal kernels and provides significant improvement for the encoder.
Currently, this trick does not work with CUDA because we seem to have issues in some cases when the src
are non-contiguous views. At the very least ggml_cuda_mul_mat_mat_batched_cublas
does not handle all cases correctly for src1
being non-contiguous because ggml_get_to_fp16_cuda()
assumes data without "holes" (i.e. contiguously-permuted), but there might be other issues as well.
We should keep this in mind and fix or assert properly
Figured I'd also include a comparison of this PR to main in benchmarks with 1 - 4 threads. Encoder times with ggml-backend-no-sched @ 0867e69 are still flat. I won't pretend to understand all the code, but this does feel like "no scheduling" to me :) ggml-backend-no-sched @ 0867e69
main @ ec7a6f0
|
Nice plot! Yeah, on |
This comment was marked as outdated.
This comment was marked as outdated.
Let me know if I can help debug this somehow. Haven't been able to reproduce with Linux and MacOS yet. |
The issue is that the encoder graph uses tensors from a previous graph. During measure, these tensors are allocated in a measure buffer which has already been freed (when the measure allocator was freed). Sometimes, A workaround would be to keep the same measure allocators alive until all the graphs have been measured, and only then reallocate the buffers and allocators with the correct sizes. I suppose that whisper.cpp is using freed tensors so it's not unreasonable to consider this "undefined behavior", but practically this is not a good limitation to have, so I want to fix this in ggml-alloc/ggml-backend by allowing the same buffers to be reallocated, but that's not going to be a quick fix. |
This doesn't fix that issue, but while looking into this I also found other problems: diff --git a/whisper.cpp b/whisper.cpp
index eb69f96..a786593 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -636,12 +636,11 @@ static void whisper_allocr_graph_init(struct whisper_allocr & allocr, ggml_backe
auto & meta = allocr.meta;
auto & buffer = allocr.buffer;
- const int tensor_alignment = ggml_backend_get_alignment(backend);
- alloc = ggml_allocr_new_measure(tensor_alignment);
+ alloc = ggml_allocr_new_measure_from_backend(backend);
meta.resize(ggml_tensor_overhead()*WHISPER_MAX_NODES + ggml_graph_overhead());
- const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph()) + tensor_alignment;
+ const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph());
ggml_allocr_free(alloc);
@@ -1284,7 +1283,7 @@ static bool whisper_model_load(struct whisper_model_loader * loader, whisper_con
// initialize the backends
#ifdef GGML_USE_CUBLAS
- if (wctx.params.use_gpu > 0) {
+ if (wctx.params.use_gpu) {
WHISPER_LOG_INFO("%s: using CUDA backend\n", __func__);
backend_gpu = ggml_backend_cuda_init();
if (!backend_gpu) { |
Ok, I'll try to apply this. If it is a quick fix, feel free to apply it here since I don't have a Windows machine to test with. I also realized another issue - the I plan to create a new backend instance for each new |
Yes, that should work. I also realized that this would be an issue in llama.cpp when creating multiple |
This should fix the issue with MSVC: diff --git a/whisper.cpp b/whisper.cpp
index d16492c..471d9a8 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -642,7 +642,7 @@ struct whisper_allocr {
};
static size_t whisper_allocr_size(struct whisper_allocr & allocr) {
- return allocr.meta.size() + ggml_backend_buffer_get_size(allocr.buffer);
+ return allocr.meta.size() + ggml_allocr_max_size(allocr.alloc);
}
// measure the memory usage of a graph and prepare the allocr's internal data buffer
@@ -655,12 +655,19 @@ static void whisper_allocr_graph_init(struct whisper_allocr & allocr, ggml_backe
meta.resize(ggml_tensor_overhead()*WHISPER_MAX_NODES + ggml_graph_overhead());
- const size_t alloc_size = ggml_allocr_alloc_graph(alloc, get_graph());
+ ggml_allocr_alloc_graph(alloc, get_graph());
+}
+
+static void whisper_allocr_graph_realloc(struct whisper_allocr & allocr, ggml_backend_t backend) {
+ auto & alloc = allocr.alloc;
+ auto & buffer = allocr.buffer;
+
+ size_t size = ggml_allocr_max_size(alloc);
ggml_allocr_free(alloc);
- buffer = ggml_backend_alloc_buffer(backend, alloc_size);
- alloc = ggml_allocr_new_from_buffer(buffer);
+ buffer = ggml_backend_alloc_buffer(backend, size);
+ alloc = ggml_allocr_new_from_buffer(buffer);
}
static void whisper_allocr_free(struct whisper_allocr & allocr) {
@@ -2915,6 +2922,11 @@ struct whisper_state * whisper_init_state(whisper_context * ctx) {
WHISPER_LOG_INFO("%s: compute buffer (decode) = %7.2f MB\n", __func__, whisper_allocr_size(state->alloc_decode) / 1024.0 / 1024.0);
}
+ whisper_allocr_graph_realloc(state->alloc_conv, ctx->backend);
+ whisper_allocr_graph_realloc(state->alloc_encode, ctx->backend);
+ whisper_allocr_graph_realloc(state->alloc_cross, ctx->backend);
+ whisper_allocr_graph_realloc(state->alloc_decode, ctx->backend);
+
state->rng = std::mt19937(0);
return state; Native windows bench:
|
Thanks. The backend fix seems to work for the CPU, but it breaks with Metal because each backend (i.e. |
It appears that the |
* whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state
* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggerganov#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggerganov#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state
In my testing on m1 pro its slower on GPU compared to 8/10 threads cpu. Does this make any sense? |
* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggerganov#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggerganov#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state
* whisper : migrate to ggml-backend * whisper : fix logit reading * whisper : fix tensor allocation during load * whisper : fix beam-search with CUDA * whisper : free backends + fix compile warning * whisper : print when CUDA is enabled * whisper : fix CoreML * make : clean-up * talk : fix compile warning * whisper : support ggml_conv with CUDA and Metal (ggerganov#1473) * ggml : add CUDA support for ggml_conv * whisper : remove ggml_repeat for conv bias + single backend * cuda : fix im2col kernel * metal : add im2col support + mul mat-vec f16 x f16 * bench-all : add q4 models * whisper : clean-up * quantize-all : fix * ggml : im2col opts * whisper : avoid whisper_model_data wrapper * whisper : add note that ggml_mul_mat_pad does not work with CUDA * whisper : factor out graph compute in common function * whisper : fixes * whisper : fix UB with measure buffers * whisper : try to fix the parallel whisper_state functionality (ggerganov#1479) * whisper : try to fix the parallel whisper_state functionality * whisper : fix multi-state Metal * whisper : free backend instances in whisper_state
Build with:
Also, the convolution ops are now offloaded both with CUDA and Metal resulting in speed-up in the Encoder (#1473)
Credits and huge thanks to @FSSRepo: ggerganov/ggml#564
If you want to have some fun, try this:
Bench on V100 and M2 Ultra