Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper : migrating to ggml-backend #1466

Closed
wants to merge 2 commits into from
Closed

whisper : migrating to ggml-backend #1466

wants to merge 2 commits into from

Conversation

ggerganov
Copy link
Owner

WIP

whisper.cpp Outdated
}
//if (!ggml_allocr_is_measure(alloc)) {
// ggml_set_f32(KQscale, 1.0f/sqrt(float(n_state)/n_head));
//}

struct ggml_tensor * cur = ggml_view_tensor(ctx0, wstate.embd_conv);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren

I'm migrating whisper.cpp to use the new scheduler and backend API.

Here, if I want to use the result of a previous graph (wstate.embd_conv) as an input, what would be the proper way to do it? If I allocated wstate.embd_conv in a dedicated input buffer, then do I need to change the following:

whisper.cpp/whisper.cpp

Lines 1592 to 1593 in 005b8cc

wstate.embd_conv = cur;

to ggml_cpy(ctx0, cur, wstate.embd_conv);? If yes, is there a way to avoid the copy?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine to use a pre-allocated tensor, I don't think that you need to make a copy here. Conceptually this should be the same as using a weight pre-allocated in another buffer, right? Maybe I am missing something.

Copy link
Collaborator

@slaren slaren Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate more: when using ggml-backend directly without ggml-backend-sched, all the tensors must be allocated in the same backend. If you wanted to evaluate a graph in a backend, and then use the output of the graph as an input of a different graph evaluated on a different backend, you would need to use ggml_backend_tensor_copy to copy the tensor to the other backend. When using ggml-backend-sched, these copies should be done automatically.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually this should be the same as using a weight pre-allocated in another buffer, right?

Yes. Currently it errors that wstate.embd_conv does not have a backend, but I think I see the problem. Will try to fix it

Another question: here I am creating a separate scheduler for each separate graph:

whisper.cpp/whisper.cpp

Lines 2865 to 2899 in 005b8cc

auto & backends = ctx->backends;
// conv allocator
{
auto & sched = state->sched_conv;
sched = ggml_backend_sched_new(backends.data(), backends.size());
ggml_backend_sched_init_measure(sched, whisper_build_graph_conv(*ctx, *state, 0));
WHISPER_LOG_INFO("%s: compute buffer (conv) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);
}
// encoder allocator
if (!whisper_encode_external(*state)) {
auto & sched = state->sched_encode;
sched = ggml_backend_sched_new(backends.data(), backends.size());
ggml_backend_sched_init_measure(sched, whisper_build_graph_encoder(*ctx, *state));
WHISPER_LOG_INFO("%s: compute buffer (encode) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);
}
// cross allocator
{
auto & sched = state->sched_cross;
sched = ggml_backend_sched_new(backends.data(), backends.size());
ggml_backend_sched_init_measure(sched, whisper_build_graph_cross(*ctx, *state));
WHISPER_LOG_INFO("%s: compute buffer (encode) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);
}

Is there a way to do this with a single scheduler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as it is currently, each graph must use a different scheduler. Maybe this is something that we could look into adding in the future, but currently all the allocations happen during the call to ggml_backend_sched_init_measure, and there is no way to add other graphs.

@slaren
Copy link
Collaborator

slaren commented Nov 9, 2023

The messages such as !!!!!!! node_0 has no backend indicate a failure while splitting the graph. There are some cases that may not be handled correctly by ggml-backend-sched yet. Do you intend to add support for partial offloading? otherwise, it may be simpler to just use ggml-backend without the scheduler.

@ggerganov
Copy link
Owner Author

There will be no support for partial offloading, but I thought it might be useful to exercise the new functionality. I think the has no backend error currently is because I'm not pre-allocating the input mel tensor for the conv graph, but still not sure. Will give this a few more tries, and might go back to ggml-backend if I can't make it work

@ggerganov
Copy link
Owner Author

This version produces addr sanitizer from time to time during the schedulers initialization:

$ ▶ make -j && ./bin/main -m ../models/ggml-base.en.bin -f ../samples/jfk.wav -ng
[ 43%] Built target whisper
[ 56%] Built target bench
[ 75%] Built target common
[100%] Built target main
[100%] Built target quantize
whisper_init_from_file_with_params_no_state: loading model from '../models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU buffer size =   140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: backend_in = CPU (7105540 bytes)
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (conv)   =   16.02 MB
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (encode) =   80.38 MB
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (encode) =    2.93 MB
ggml_backend_sched size: 1468 KB
/Users/ggerganov/development/github/whisper.cpp/ggml-backend.c:521:25: runtime error: index 12289 out of bounds for type 'char[12288][128]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/ggerganov/development/github/whisper.cpp/ggml-backend.c:521:25 in 
whisper_init_state: compute buffer (encode) =   23.13 MB

system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

Will leave it for now and try without the scheduler in a separate branch.

@slaren
Copy link
Collaborator

slaren commented Nov 9, 2023

The out of bounds access is in the causes buffer which is just for debug, and might be too small for the whisper graph. The size can be increased as much as needed, it should be removed soon. But if you don't need to support partial offloading there is nothing to gain by using the scheduler anyway.

@ggerganov ggerganov closed this Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants