whisper : migrating to ggml-backend #1466

ggerganov · 2023-11-09T13:44:31Z

WIP

ggerganov · 2023-11-09T13:52:56Z

whisper.cpp

-    }
+    //if (!ggml_allocr_is_measure(alloc)) {
+    //    ggml_set_f32(KQscale, 1.0f/sqrt(float(n_state)/n_head));
+    //}

    struct ggml_tensor * cur = ggml_view_tensor(ctx0, wstate.embd_conv);


@slaren

I'm migrating whisper.cpp to use the new scheduler and backend API.

Here, if I want to use the result of a previous graph (wstate.embd_conv) as an input, what would be the proper way to do it? If I allocated wstate.embd_conv in a dedicated input buffer, then do I need to change the following:

whisper.cpp/whisper.cpp

Lines 1592 to 1593 in 005b8cc

wstate.embd_conv = cur;

to ggml_cpy(ctx0, cur, wstate.embd_conv);? If yes, is there a way to avoid the copy?

It is fine to use a pre-allocated tensor, I don't think that you need to make a copy here. Conceptually this should be the same as using a weight pre-allocated in another buffer, right? Maybe I am missing something.

To elaborate more: when using ggml-backend directly without ggml-backend-sched, all the tensors must be allocated in the same backend. If you wanted to evaluate a graph in a backend, and then use the output of the graph as an input of a different graph evaluated on a different backend, you would need to use ggml_backend_tensor_copy to copy the tensor to the other backend. When using ggml-backend-sched, these copies should be done automatically.

Conceptually this should be the same as using a weight pre-allocated in another buffer, right?

Yes. Currently it errors that wstate.embd_conv does not have a backend, but I think I see the problem. Will try to fix it

Another question: here I am creating a separate scheduler for each separate graph:

whisper.cpp/whisper.cpp

Lines 2865 to 2899 in 005b8cc

auto & backends = ctx->backends;

// conv allocator

{

auto & sched = state->sched_conv;

sched = ggml_backend_sched_new(backends.data(), backends.size());

ggml_backend_sched_init_measure(sched, whisper_build_graph_conv(*ctx, *state, 0));

WHISPER_LOG_INFO("%s: compute buffer (conv) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);

}

// encoder allocator

if (!whisper_encode_external(*state)) {

auto & sched = state->sched_encode;

sched = ggml_backend_sched_new(backends.data(), backends.size());

ggml_backend_sched_init_measure(sched, whisper_build_graph_encoder(*ctx, *state));

WHISPER_LOG_INFO("%s: compute buffer (encode) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);

}

// cross allocator

{

auto & sched = state->sched_cross;

sched = ggml_backend_sched_new(backends.data(), backends.size());

ggml_backend_sched_init_measure(sched, whisper_build_graph_cross(*ctx, *state));

WHISPER_LOG_INFO("%s: compute buffer (encode) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);

}

Is there a way to do this with a single scheduler?

No, as it is currently, each graph must use a different scheduler. Maybe this is something that we could look into adding in the future, but currently all the allocations happen during the call to ggml_backend_sched_init_measure, and there is no way to add other graphs.

slaren · 2023-11-09T14:26:05Z

The messages such as !!!!!!! node_0 has no backend indicate a failure while splitting the graph. There are some cases that may not be handled correctly by ggml-backend-sched yet. Do you intend to add support for partial offloading? otherwise, it may be simpler to just use ggml-backend without the scheduler.

ggerganov · 2023-11-09T14:32:25Z

There will be no support for partial offloading, but I thought it might be useful to exercise the new functionality. I think the has no backend error currently is because I'm not pre-allocating the input mel tensor for the conv graph, but still not sure. Will give this a few more tries, and might go back to ggml-backend if I can't make it work

ggerganov · 2023-11-09T17:19:50Z

This version produces addr sanitizer from time to time during the schedulers initialization:

$ ▶ make -j && ./bin/main -m ../models/ggml-base.en.bin -f ../samples/jfk.wav -ng
[ 43%] Built target whisper
[ 56%] Built target bench
[ 75%] Built target common
[100%] Built target main
[100%] Built target quantize
whisper_init_from_file_with_params_no_state: loading model from '../models/ggml-base.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU buffer size =   140.66 MB
whisper_model_load: model size    =  140.54 MB
whisper_init_state: kv self size  =    5.25 MB
whisper_init_state: kv cross size =   17.58 MB
whisper_init_state: backend_in = CPU (7105540 bytes)
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (conv)   =   16.02 MB
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (encode) =   80.38 MB
ggml_backend_sched size: 1468 KB
whisper_init_state: compute buffer (encode) =    2.93 MB
ggml_backend_sched size: 1468 KB
/Users/ggerganov/development/github/whisper.cpp/ggml-backend.c:521:25: runtime error: index 12289 out of bounds for type 'char[12288][128]'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/ggerganov/development/github/whisper.cpp/ggml-backend.c:521:25 in 
whisper_init_state: compute buffer (encode) =   23.13 MB

system_info: n_threads = 4 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | METAL = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing '../samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...

Will leave it for now and try without the scheduler in a separate branch.

slaren · 2023-11-09T17:38:28Z

The out of bounds access is in the causes buffer which is just for debug, and might be too small for the whisper graph. The size can be increased as much as needed, it should be removed soon. But if you don't need to support partial offloading there is nothing to gain by using the scheduler anyway.

whisper : migrating to ggml-backend (wip)

005b8cc

ggerganov commented Nov 9, 2023

View reviewed changes

whisper : wip sched (not working yet)

bf4110d

ggerganov closed this Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper : migrating to ggml-backend #1466

whisper : migrating to ggml-backend #1466

ggerganov commented Nov 9, 2023

ggerganov Nov 9, 2023

slaren Nov 9, 2023

slaren Nov 9, 2023 •

edited

Loading

ggerganov Nov 9, 2023

slaren Nov 9, 2023

slaren commented Nov 9, 2023

ggerganov commented Nov 9, 2023

ggerganov commented Nov 9, 2023

slaren commented Nov 9, 2023

	auto & backends = ctx->backends;

	// conv allocator
	{
	auto & sched = state->sched_conv;

	sched = ggml_backend_sched_new(backends.data(), backends.size());

	ggml_backend_sched_init_measure(sched, whisper_build_graph_conv(ctx, state, 0));

	WHISPER_LOG_INFO("%s: compute buffer (conv) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);
	}

	// encoder allocator
	if (!whisper_encode_external(*state)) {
	auto & sched = state->sched_encode;

	sched = ggml_backend_sched_new(backends.data(), backends.size());

	ggml_backend_sched_init_measure(sched, whisper_build_graph_encoder(ctx, state));

	WHISPER_LOG_INFO("%s: compute buffer (encode) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);
	}

	// cross allocator
	{
	auto & sched = state->sched_cross;

	sched = ggml_backend_sched_new(backends.data(), backends.size());

	ggml_backend_sched_init_measure(sched, whisper_build_graph_cross(ctx, state));

	WHISPER_LOG_INFO("%s: compute buffer (encode) = %7.2f MB\n", __func__, whisper_sched_size(sched, backends) / 1024.0 / 1024.0);
	}

whisper : migrating to ggml-backend #1466

whisper : migrating to ggml-backend #1466

Conversation

ggerganov commented Nov 9, 2023

ggerganov Nov 9, 2023

Choose a reason for hiding this comment

slaren Nov 9, 2023

Choose a reason for hiding this comment

slaren Nov 9, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov Nov 9, 2023

Choose a reason for hiding this comment

slaren Nov 9, 2023

Choose a reason for hiding this comment

slaren commented Nov 9, 2023

ggerganov commented Nov 9, 2023

ggerganov commented Nov 9, 2023

slaren commented Nov 9, 2023

slaren Nov 9, 2023 •

edited

Loading