-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : Metal and ggml-alloc support #1270
Conversation
whisper.cpp
Outdated
state->alloc_encode = ggml_allocr_new_measure(tensor_alignment); | ||
state->alloc_encode_post = ggml_allocr_new_measure(tensor_alignment); | ||
state->alloc_decode = ggml_allocr_new_measure(tensor_alignment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a chance that this will not work in some systems with limited virtual memory, such as iOS, because each measure allocator reserves a large amount of virtual memory. It would be safer to allocate only one measure allocator at a time, I think that should be possible here.
It's definitely not ideal that ggml-alloc has this limitation, I expect to improve this and remove the use of virtual memory entirely with the common backends interface implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I reorganized the allocators as proposed, but it seems some OSes still fail during the second new_measure
- see linux/arm64
and linux/ppc64le
in the CI: https://github.com/ggerganov/whisper.cpp/actions/runs/6146272809/job/16675319319
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused by this, it seems that the call to mmap
is crashing the process instead of returning an error, because otherwise we should see the failed assert GGML_ASSERT(!"failed to allocate virtual memory for measure buffer");
. I imagine that this is related to QEMU, I'll try to reproduce it locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the exact same commands that the CI uses to run the arm64 version with docker and QEMU, and it works on my computer. So whatever is the issue, it only seems to happen on the github CI environment and I cannot reproduce it. Maybe it is hitting some memory usage limit.
$ sudo docker run --platform linux/arm64 --rm \
-v /home/diego/code/whisper.cpp:/workspace \
-w /workspace ubuntu:22.04 /bin/sh -c '
apt update
apt install -y build-essential cmake libsdl2-dev
cmake . -DWHISPER_SUPPORT_SDL2=ON -DCMAKE_BUILD_TYPE=Release
make
ctest -L gh --output-on-failure'
[...]
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find Git (missing: GIT_EXECUTABLE)
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CMAKE_SYSTEM_PROCESSOR: aarch64
-- ARM detected
-- Configuring done
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:
WHISPER_SUPPORT_SDL2
-- Build files have been written to: /workspace
[ 7%] Building C object CMakeFiles/whisper.dir/ggml.c.o
[ 15%] Building C object CMakeFiles/whisper.dir/ggml-alloc.c.o
[ 23%] Building CXX object CMakeFiles/whisper.dir/whisper.cpp.o
[ 30%] Linking CXX shared library libwhisper.so
[ 30%] Built target whisper
[ 38%] Building CXX object examples/CMakeFiles/common.dir/common.cpp.o
[ 46%] Building CXX object examples/CMakeFiles/common.dir/common-ggml.cpp.o
[ 53%] Linking CXX static library libcommon.a
[ 53%] Built target common
[ 61%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o
[ 69%] Linking CXX executable ../../bin/main
[ 69%] Built target main
[ 76%] Building CXX object examples/bench/CMakeFiles/bench.dir/bench.cpp.o
[ 84%] Linking CXX executable ../../bin/bench
[ 84%] Built target bench
[ 92%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o
[100%] Linking CXX executable ../../bin/quantize
[100%] Built target quantize
Test project /workspace
Start 1: test-main-tiny
1/2 Test #1: test-main-tiny ................... Passed 82.79 sec
Start 2: test-main-tiny.en
2/2 Test #2: test-main-tiny.en ................ Passed 83.46 sec
100% tests passed, 0 tests failed out of 2
Label Time Summary:
en = 83.46 sec*proc (1 test)
gh = 166.24 sec*proc (2 tests)
tiny = 166.24 sec*proc (2 tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible workaround could be reducing the amount of virtual memory allocated here:
Lines 345 to 346 in d3b2dd4
// 1TB for 64-bit, 1GB for 32-bit | |
*size = sizeof(void *) == 4 ? 1ULL<<30 : 1ULL<<40; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for looking into this. I'll now continue working on this branch and try to find a solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing the size to 128GB fixes the CI: b19888c
static void * alloc_vmem(size_t size) { | ||
#if defined(_WIN32) | ||
return VirtualAlloc(NULL, size, MEM_RESERVE, PAGE_NOACCESS); | ||
#elif defined(_POSIX_MAPPED_FILES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the emscripten build doesn't work, it can be excluded from using mmap
here and in free_vmem
by checking if __EMSCRIPTEN__
is defined. I think this should do it:
#elif defined(_POSIX_MAPPED_FILES) | |
#elif defined(_POSIX_MAPPED_FILES) && !defined(__EMSCRIPTEN__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news - the Emscripten build looks to be working without any adjustments needed
1b9b645
to
4d9acc6
Compare
I recently ran some performance tests on |
The Lines 4304 to 4312 in 79a8805
# cur:
ne0 = 512, ne1 = 1500, ne2 = 1, ne3 = 1
nb0 = 4, nb1 = 2048, nb2 = 3072000, nb3 = 3072000
ggml_nbytes = 3072000
# ggml_transpose(ctx, cur):
ne0 = 1500, ne1 = 512, ne2 = 1, ne3 = 1
nb0 = 2048, nb1 = 4, nb2 = 3072000, nb3 = 3072000
ggml_nbytes = 3074044 @slaren tagging you to keep in mind |
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one works, but it's kind of stupid to sort the elements each time.
Is there something better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the goal of my implementation was to calculate the offset of the last element plus one. However, the implementation assumes that nb[0] == type_size
, so it doesn't work with transposed tensors. This should fix it for blck_size == 1
:
size_t nbytes = ggml_type_size(tensor->type);
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
}
However, this will not work with quantized types. A possible solution could be fall back to the previous implementation for blck_size > 1
, but it would be nicer to have a single implementation.
size_t ggml_nbytes(const struct ggml_tensor * tensor) {
size_t nbytes;
size_t blck_size = ggml_blck_size(tensor->type);
if (blck_size == 1) {
nbytes = ggml_type_size(tensor->type);
for (int i = 0; i < GGML_MAX_DIMS; ++i) {
nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
}
}
else {
nbytes = tensor->ne[0]*tensor->nb[0]/blck_size;
for (int i = 1; i < GGML_MAX_DIMS; ++i) {
nbytes += (tensor->ne[i] - 1)*tensor->nb[i];
}
}
return nbytes;
}
Ok, so running the Core ML on the CPU + GPU is indeed faster than running it on the ANE (see updated times in the OP). Also there is some strange behavior where the first Core ML Encoder run after starting the process is comparable to the Metal version, but next runs are x2 times faster. This happens only with CPU + GPU Core ML, it does not happen with ANE Core ML. I've updated the I suppose that Core ML does some extra optimizations on the first run. I.e. the first Core ML GPU run is similar in performance to my Metal implementation, but then it gets a x2 factor lead. Would be interesting if whatever optimization occurs can be done manually in the Metal code. Would be a great benefit with potential dramatic improvement in the Decoder where we cannot use Core ML, but we can use Metal. Edit: Here is some specific numbers with the Medium model:
The Metal version also gets a little boost after the first run - probably some caches are heated, etc. But it is not so significant as the Core ML (GPU) one. Also note that the ANE version does not get such speed-up. There is also the explanation that Core ML does some initialization stuff on the first run which inflates the number and hence there is no "optimization" going on. Anyway, any insight will be appreciated |
The improvement over ANE is insanely impressive. I wonder if it will enable me to actually run the larger models on the M1 Pro with CoreML. |
I was very curious how this would perform on iOS, so I did some testing on it:
In mybigday/whisper.rn#123, I use commit f408c64, the difference is that I used ARC in ggml-metal.m and removed some release code. The app archive will be more faster than the build, but currently I use Xcode build to easily test. Compared with iOS Device, the number of GPU cores is relatively less, I think that's why it has a gap on iPhone. I don't know why CoreML is not working on my M1 iPad, I didn't enable it in Production before so I didn't find out before. What I love about the Metal backend is that it uses less disk size & memory, this means we can use a larger models in real-world scenarios. (Maybe we could consider to support CoreML OPs for GGML instead of load a mlmodelc in the future) UPDATE: Load & Full w/ Core ML (CPU+GPU) [ms]
There is also a situation where the second run is faster, but the perf is not faster than ANE on iPhone. |
b38f8a4
to
2b4160a
Compare
@jhen0409 Thank you very much for the results! I've further improved the Metal inference and now it is as fast as Core ML (GPU). I still have one kernel that is not optimized (the convolution at the start of the Encoder) so this explains the remaining difference between Metal and Core ML (GPU). I'm now satisfied with the results and indeed, Core ML does not do any extra optimizations - it is just slower the first run because it probably initializes some internal things. The optimization that I did is to pad matrix multiplications that have a row dimension not multiple of 32. It's a very simple code change that can provide significant benefits for Metal. It should be straightforward to apply to Lines 138 to 171 in 2b4160a
|
On M1 Pro with 32 GB RAM against a 40s recording The improvement on medium size and up is staggering. Compared to master, we are seeing 2x speedup, compared to yesterday ~1 second.
|
@ggerganov This is amazing! Thank you so, so much! On an M2-Ultra (76-core), I'm now seeing 25x realtime for the medium model (so ~2 minutes to process a ~50 minute long audio, including diarization). And, most importantly, really good transcription results so far (especially vs previous results). |
These are the timings I got from what I think is the current master model versions for ./samples/gb0.wav (127.4s) on an M2 MAX with 32GB RAM on MacOS 13.5.1 with 8 threads (all timings in [ms]); maybe I have not done something right because the encoding and decoding times seem quite slow compared with those @ggerganov posted for an M2 Ultra on X(formerly Twitter):
|
* metal : init * whisper : factor out graph builds * whisper : allocate encoder and decoder using ggml-alloc * whisper : ggml-alloc is now supported * whisper : CoreML support ggml-alloc * build : fix ggml-alloc * ios : update submodule * extra : update sync-ggml.sh script to also sync ggml-alloc * ci : see if this is causing the crash * whisper : refactor ggml-alloc init * whisper.android : try to fix build * whisper : initial Metal version * ci : try to debug vmem issue * metal : decoder works on GPU! * metal : add multi-decoder support * ggml : fix ggml_nbytes (probably temp solution) * metal : run "cross" step on the GPU * whisper : remove ggml_repeat in the encoder * whisper : offload the Encoder to Metal * ggml : use simpler ggml_bytes() implementation * ggml-alloc : try to make CI happy by reducing vram to 128GB * whisper : add whisper_allocr to wrap ggml_allocr * whisper : factor out alloc init in a function * cmake : update to support Metal build * whisper : add <functional> header * objc : fix build (no Metal yet) * ios : add Metal support * swiftui : fix build * metal : speed-up KQ multiplication * metal : sync latest llama.cpp kernels * readme : add Metal info * ios : update submodule * coreml : add code to toggle Core ML config (CPU, ANE, GPU) * bench : fix timings by running a pre-heat * bench : start benching the decoder * whisper : add ggml_mul_mat_pad * bench : fix uninitialized vars * whisper : add comment for disabling mul-mat padding * whisper : add description of ggml_mul_mat_pad * whisper : clean-up ggml_mul_mat_pad * metal : remove the "concurrent" flag * bench : variable n_past * ios : update SPM package
* metal : init * whisper : factor out graph builds * whisper : allocate encoder and decoder using ggml-alloc * whisper : ggml-alloc is now supported * whisper : CoreML support ggml-alloc * build : fix ggml-alloc * ios : update submodule * extra : update sync-ggml.sh script to also sync ggml-alloc * ci : see if this is causing the crash * whisper : refactor ggml-alloc init * whisper.android : try to fix build * whisper : initial Metal version * ci : try to debug vmem issue * metal : decoder works on GPU! * metal : add multi-decoder support * ggml : fix ggml_nbytes (probably temp solution) * metal : run "cross" step on the GPU * whisper : remove ggml_repeat in the encoder * whisper : offload the Encoder to Metal * ggml : use simpler ggml_bytes() implementation * ggml-alloc : try to make CI happy by reducing vram to 128GB * whisper : add whisper_allocr to wrap ggml_allocr * whisper : factor out alloc init in a function * cmake : update to support Metal build * whisper : add <functional> header * objc : fix build (no Metal yet) * ios : add Metal support * swiftui : fix build * metal : speed-up KQ multiplication * metal : sync latest llama.cpp kernels * readme : add Metal info * ios : update submodule * coreml : add code to toggle Core ML config (CPU, ANE, GPU) * bench : fix timings by running a pre-heat * bench : start benching the decoder * whisper : add ggml_mul_mat_pad * bench : fix uninitialized vars * whisper : add comment for disabling mul-mat padding * whisper : add description of ggml_mul_mat_pad * whisper : clean-up ggml_mul_mat_pad * metal : remove the "concurrent" flag * bench : variable n_past * ios : update SPM package
* metal : init * whisper : factor out graph builds * whisper : allocate encoder and decoder using ggml-alloc * whisper : ggml-alloc is now supported * whisper : CoreML support ggml-alloc * build : fix ggml-alloc * ios : update submodule * extra : update sync-ggml.sh script to also sync ggml-alloc * ci : see if this is causing the crash * whisper : refactor ggml-alloc init * whisper.android : try to fix build * whisper : initial Metal version * ci : try to debug vmem issue * metal : decoder works on GPU! * metal : add multi-decoder support * ggml : fix ggml_nbytes (probably temp solution) * metal : run "cross" step on the GPU * whisper : remove ggml_repeat in the encoder * whisper : offload the Encoder to Metal * ggml : use simpler ggml_bytes() implementation * ggml-alloc : try to make CI happy by reducing vram to 128GB * whisper : add whisper_allocr to wrap ggml_allocr * whisper : factor out alloc init in a function * cmake : update to support Metal build * whisper : add <functional> header * objc : fix build (no Metal yet) * ios : add Metal support * swiftui : fix build * metal : speed-up KQ multiplication * metal : sync latest llama.cpp kernels * readme : add Metal info * ios : update submodule * coreml : add code to toggle Core ML config (CPU, ANE, GPU) * bench : fix timings by running a pre-heat * bench : start benching the decoder * whisper : add ggml_mul_mat_pad * bench : fix uninitialized vars * whisper : add comment for disabling mul-mat padding * whisper : add description of ggml_mul_mat_pad * whisper : clean-up ggml_mul_mat_pad * metal : remove the "concurrent" flag * bench : variable n_past * ios : update SPM package
* metal : init * whisper : factor out graph builds * whisper : allocate encoder and decoder using ggml-alloc * whisper : ggml-alloc is now supported * whisper : CoreML support ggml-alloc * build : fix ggml-alloc * ios : update submodule * extra : update sync-ggml.sh script to also sync ggml-alloc * ci : see if this is causing the crash * whisper : refactor ggml-alloc init * whisper.android : try to fix build * whisper : initial Metal version * ci : try to debug vmem issue * metal : decoder works on GPU! * metal : add multi-decoder support * ggml : fix ggml_nbytes (probably temp solution) * metal : run "cross" step on the GPU * whisper : remove ggml_repeat in the encoder * whisper : offload the Encoder to Metal * ggml : use simpler ggml_bytes() implementation * ggml-alloc : try to make CI happy by reducing vram to 128GB * whisper : add whisper_allocr to wrap ggml_allocr * whisper : factor out alloc init in a function * cmake : update to support Metal build * whisper : add <functional> header * objc : fix build (no Metal yet) * ios : add Metal support * swiftui : fix build * metal : speed-up KQ multiplication * metal : sync latest llama.cpp kernels * readme : add Metal info * ios : update submodule * coreml : add code to toggle Core ML config (CPU, ANE, GPU) * bench : fix timings by running a pre-heat * bench : start benching the decoder * whisper : add ggml_mul_mat_pad * bench : fix uninitialized vars * whisper : add comment for disabling mul-mat padding * whisper : add description of ggml_mul_mat_pad * whisper : clean-up ggml_mul_mat_pad * metal : remove the "concurrent" flag * bench : variable n_past * ios : update SPM package
* metal : init * whisper : factor out graph builds * whisper : allocate encoder and decoder using ggml-alloc * whisper : ggml-alloc is now supported * whisper : CoreML support ggml-alloc * build : fix ggml-alloc * ios : update submodule * extra : update sync-ggml.sh script to also sync ggml-alloc * ci : see if this is causing the crash * whisper : refactor ggml-alloc init * whisper.android : try to fix build * whisper : initial Metal version * ci : try to debug vmem issue * metal : decoder works on GPU! * metal : add multi-decoder support * ggml : fix ggml_nbytes (probably temp solution) * metal : run "cross" step on the GPU * whisper : remove ggml_repeat in the encoder * whisper : offload the Encoder to Metal * ggml : use simpler ggml_bytes() implementation * ggml-alloc : try to make CI happy by reducing vram to 128GB * whisper : add whisper_allocr to wrap ggml_allocr * whisper : factor out alloc init in a function * cmake : update to support Metal build * whisper : add <functional> header * objc : fix build (no Metal yet) * ios : add Metal support * swiftui : fix build * metal : speed-up KQ multiplication * metal : sync latest llama.cpp kernels * readme : add Metal info * ios : update submodule * coreml : add code to toggle Core ML config (CPU, ANE, GPU) * bench : fix timings by running a pre-heat * bench : start benching the decoder * whisper : add ggml_mul_mat_pad * bench : fix uninitialized vars * whisper : add comment for disabling mul-mat padding * whisper : add description of ggml_mul_mat_pad * whisper : clean-up ggml_mul_mat_pad * metal : remove the "concurrent" flag * bench : variable n_past * ios : update SPM package
* metal : init * whisper : factor out graph builds * whisper : allocate encoder and decoder using ggml-alloc * whisper : ggml-alloc is now supported * whisper : CoreML support ggml-alloc * build : fix ggml-alloc * ios : update submodule * extra : update sync-ggml.sh script to also sync ggml-alloc * ci : see if this is causing the crash * whisper : refactor ggml-alloc init * whisper.android : try to fix build * whisper : initial Metal version * ci : try to debug vmem issue * metal : decoder works on GPU! * metal : add multi-decoder support * ggml : fix ggml_nbytes (probably temp solution) * metal : run "cross" step on the GPU * whisper : remove ggml_repeat in the encoder * whisper : offload the Encoder to Metal * ggml : use simpler ggml_bytes() implementation * ggml-alloc : try to make CI happy by reducing vram to 128GB * whisper : add whisper_allocr to wrap ggml_allocr * whisper : factor out alloc init in a function * cmake : update to support Metal build * whisper : add <functional> header * objc : fix build (no Metal yet) * ios : add Metal support * swiftui : fix build * metal : speed-up KQ multiplication * metal : sync latest llama.cpp kernels * readme : add Metal info * ios : update submodule * coreml : add code to toggle Core ML config (CPU, ANE, GPU) * bench : fix timings by running a pre-heat * bench : start benching the decoder * whisper : add ggml_mul_mat_pad * bench : fix uninitialized vars * whisper : add comment for disabling mul-mat padding * whisper : add description of ggml_mul_mat_pad * whisper : clean-up ggml_mul_mat_pad * metal : remove the "concurrent" flag * bench : variable n_past * ios : update SPM package
This PR adds Metal support for full GPU inference on Apple Silicon.
It also optimizes memory usage.
metal-base-1.mp4
metal-medium-1.mp4
Usage:
make clean make -j && ./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav
make clean WHISPER_COREML=1 make -j && ./main -m ./models/ggml-base.en.bin -f ./samples/gb0.wav
TODOs
ggml-alloc
when readyggml_nbytes()
ggml-metal.metal
in thebin
folder