-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : migrating to ggml-backend #1466
Conversation
whisper.cpp
Outdated
} | ||
//if (!ggml_allocr_is_measure(alloc)) { | ||
// ggml_set_f32(KQscale, 1.0f/sqrt(float(n_state)/n_head)); | ||
//} | ||
|
||
struct ggml_tensor * cur = ggml_view_tensor(ctx0, wstate.embd_conv); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm migrating whisper.cpp
to use the new scheduler and backend API.
Here, if I want to use the result of a previous graph (wstate.embd_conv
) as an input, what would be the proper way to do it? If I allocated wstate.embd_conv
in a dedicated input buffer, then do I need to change the following:
Lines 1592 to 1593 in 005b8cc
wstate.embd_conv = cur; |
to ggml_cpy(ctx0, cur, wstate.embd_conv);
? If yes, is there a way to avoid the copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine to use a pre-allocated tensor, I don't think that you need to make a copy here. Conceptually this should be the same as using a weight pre-allocated in another buffer, right? Maybe I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate more: when using ggml-backend
directly without ggml-backend-sched
, all the tensors must be allocated in the same backend. If you wanted to evaluate a graph in a backend, and then use the output of the graph as an input of a different graph evaluated on a different backend, you would need to use ggml_backend_tensor_copy
to copy the tensor to the other backend. When using ggml-backend-sched
, these copies should be done automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually this should be the same as using a weight pre-allocated in another buffer, right?
Yes. Currently it errors that wstate.embd_conv
does not have a backend, but I think I see the problem. Will try to fix it
Another question: here I am creating a separate scheduler for each separate graph:
Lines 2865 to 2899 in 005b8cc
auto & backends = ctx->backends; | |
// conv allocator | |
{ | |
auto & sched = state->sched_conv; | |
sched = ggml_backend_sched_new(backends.data(), backends.size()); | |
ggml_backend_sched_init_measure(sched, whisper_build_graph_conv(*ctx, *state, 0)); | |
WHISPER_LOG_INFO("%s: compute buffer (conv) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0); | |
} | |
// encoder allocator | |
if (!whisper_encode_external(*state)) { | |
auto & sched = state->sched_encode; | |
sched = ggml_backend_sched_new(backends.data(), backends.size()); | |
ggml_backend_sched_init_measure(sched, whisper_build_graph_encoder(*ctx, *state)); | |
WHISPER_LOG_INFO("%s: compute buffer (encode) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0); | |
} | |
// cross allocator | |
{ | |
auto & sched = state->sched_cross; | |
sched = ggml_backend_sched_new(backends.data(), backends.size()); | |
ggml_backend_sched_init_measure(sched, whisper_build_graph_cross(*ctx, *state)); | |
WHISPER_LOG_INFO("%s: compute buffer (encode) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0); | |
} | |
Is there a way to do this with a single scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as it is currently, each graph must use a different scheduler. Maybe this is something that we could look into adding in the future, but currently all the allocations happen during the call to ggml_backend_sched_init_measure
, and there is no way to add other graphs.
The messages such as |
There will be no support for partial offloading, but I thought it might be useful to exercise the new functionality. I think the |
This version produces addr sanitizer from time to time during the schedulers initialization: $ ▶ make -j && ./bin/main -m ../models/ggml-base.en.bin -f ../samples/jfk.wav -ng
[ 43%] Built target whisper
[ 56%] Built target bench
[ 75%] Built target common
[100%] Built target main
[100%] Built target quantize
whisper_init_from_file_with_params_no_state: loading model from '../models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab = 51864
whisper_model_load: n_audio_ctx = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx = 448
whisper_model_load: n_text_state = 512
whisper_model_load: n_text_head = 8
whisper_model_load: n_text_layer = 6
whisper_model_load: n_mels = 80
whisper_model_load: ftype = 1
whisper_model_load: qntvr = 0
whisper_model_load: type = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs = 99
whisper_model_load: CPU buffer size = 140.66 MB
whisper_model_load: model size = 140.54 MB
whisper_init_state: kv self size = 5.25 MB
whisper_init_state: kv cross size = 17.58 MB
whisper_init_state: backend_in = CPU (7105540 bytes)
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (conv) = 16.02 MB
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (encode) = 80.38 MB
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (encode) = 2.93 MB
ggml_backend_sched size: 1468 KB
/Users/ggerganov/development/github/whisper.cpp/ggml-backend.c:521:25: runtime error: index 12289 out of bounds for type 'char[12288][128]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/ggerganov/development/github/whisper.cpp/ggml-backend.c:521:25 in
whisper_init_state: compute buffer (encode) = 23.13 MB
system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |
main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... Will leave it for now and try without the scheduler in a separate branch. |
The out of bounds access is in the |
WIP