[Severe] Memory leak issue under WebGPU Whisper transcribe pipeline #860

MatteoFasulo · 2024-07-23T15:59:12Z

System Info

Using transformers.js v3 in latest Chrome release on Windows 10.

GPU: Nvidia GTX 1080 (8GB)

Environment/Platform

Description

Transcribe using Whisper model with WebGPU does not dispose the tensor after finishing the pipeline. Checked that in nvidia-smi while transcribing a .wav file into text. Memory consumption keeps growing until either it goes out-of memory (for smaller GPUs) or looses the device meanwhile the computation is going (producing console error saying 'device is lost').

Reproduction

Use transcribe pipeline for transcribing a wav file of at least 1 minute
Check GPU memory consumption while doing the computation (should increase meanwhile the computation is going on)
Verify if the tensor is correctly being disposed after the computation is done (here actually it does not dispose the data in the GPU hence generating leaks)
Can be easily verified using longer audio sequences (they enlarge the difference between resting GPU memory and meanwhile the computation)

Ideas (from Ratchet)

I spotted this great architecture of inference at https://github.com/huggingface/ratchet/blob/master/ARCHITECTURE.md in which the memory consumption for encoder-decoder model like Whisper is reduced by supporting both static & dynamic graphs as to have encoder completely static and decoder running under a dynamic graph due to KV caching.

The text was updated successfully, but these errors were encountered:

flatsiedatsie · 2024-07-30T07:21:09Z

I'm seeing an error that only pops up after a while. Perhaps it's related?

MatteoFasulo · 2024-08-02T11:11:59Z

I'm seeing an error that only pops up after a while. Perhaps it's related?

Could be related, in my case it is strictly connected with very long audio sequences (up to 10 minutes).

flatsiedatsie · 2024-08-14T12:03:01Z

I think I'm seeing it too.

Mac OS, Brave browser, Transformers.js V3.

MatteoFasulo · 2024-08-15T08:01:30Z

I think I'm seeing it too.

Mac OS, Brave browser, Transformers.js V3.

Yes, if you want to double check try to use nvidia-smi from your command terminal and verify the GPU memory consumption.

xenova · 2024-08-15T08:21:39Z

Thanks to everyone for testing! Perhaps @guschmue and the ORT team can do some additional profiling to see what's going wrong.

flatsiedatsie · 2024-08-15T08:25:53Z

@MatteoFasulo I'm using a Macbook, so I don't have that option.

Also, my pipeline uses three models at the same time, which could perhaps play a role. My whisper worker is also using segmentation and verification models at the same time as whisper timestamped. Though those run via WASM, so.. hmm.

class PipelineSingleton {
    static asr_model_id = 'onnx-community/whisper-base_timestamped';
    static instance = null;
    static asr_instance = null;

    static segmentation_model_id = 'onnx-community/pyannote-segmentation-3.0';
    static segmentation_instance = null;
    static segmentation_processor = null;
	
	static verification_model_id = 'Xenova/wavlm-base-plus-sv'; // Xenova/wavlm-base-plus-sv
    //static verification_model_id = 'onnx-community/wespeaker-voxceleb-resnet34-LM';
    static verification_instance = null;
    static verification_processor = null;
	

    static async getInstance(progress_callback = null,model_name='onnx-community/whisper-base_timestamped',preferences={}) {
		//console.log("Whisper_worker: Pipeline: getInstance:  model_name, preferences: ", model_name, preferences);
		this.asr_model_id = model_name;

		PER_DEVICE_CONFIG[self.device] = {...PER_DEVICE_CONFIG[self.device],preferences}
		
        this.asr_instance ??= pipeline('automatic-speech-recognition', this.asr_model_id, {
            ...PER_DEVICE_CONFIG[self.device],
            progress_callback,
        });

        this.segmentation_processor ??= AutoProcessor.from_pretrained(this.segmentation_model_id, {
			...preferences,
            progress_callback,
        });
        this.segmentation_instance ??= AutoModelForAudioFrameClassification.from_pretrained(this.segmentation_model_id, {
            // NOTE: WebGPU is not currently supported for this model
            // See https://github.com/microsoft/onnxruntime/issues/21386
            device: 'wasm',
            dtype: 'fp32',
			...preferences,
            progress_callback,
        });
		
        this.verification_processor ??= AutoProcessor.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
			...preferences,
            progress_callback,
        });
		
        this.verification_instance ??= AutoModel.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
			...preferences,
            progress_callback,
        });

        return Promise.all([this.asr_instance, this.segmentation_processor, this.segmentation_instance, this.verification_processor, this.verification_instance]);
    }
}

flatsiedatsie · 2024-08-15T11:27:39Z

I created this image earlier to ask how I could reduce Whisper memory use after an inference was complete.

But I wasn't sure if the issue was Whisper, or the Segmentation or voice fingerprinting model. Still, perhaps it's related.

MatteoFasulo · 2024-08-16T10:18:19Z

I created this image earlier to ask how I could reduce Whisper memory use after an inference was complete.

But I wasn't sure if the issue was Whisper, or the Segmentation or voice fingerprinting model. Still, perhaps it's related.

Right now I do not have all the data but in my case the dispose was not working as intended. I tried to apply dispose at the end of each task but the GPU memory consumption was still the same as if the dispose was not applied.

MatteoFasulo · 2024-08-25T00:43:33Z

@xenova I tested default whisper-tiny as well as whisper-small.en_timestamped using the ORT Web Perf tool by @guschmue .

However, the tool does not provide any information about memory footprint nor GPU memory consumption.

Do you have any suggestion about checking if the error depends only on Whisper models or if there is something else which does not clear used memory before starting new tasks?

deanihansen · 2024-09-05T11:59:58Z

@xenova is there anyone we can poke or ways we can support getting this looked into? It looks like v3 is right around the corner, and it would be incredible to have a solution for this before the release.

gyagp · 2024-09-06T05:36:08Z

Sorry for the late response, and I can take a look at this. @MatteoFasulo Could you please share a simple example that could reproduce this issue? Thanks!

deanihansen · 2024-09-06T07:50:08Z

@gyagp I've repro'ed this locally by:

Pull this example here https://github.com/xenova/transformers.js/tree/v3/examples/webgpu-whisper
Update the example's references to transformers by..
Replace the package.json reference for '"@xenova/transformers' with the latest "@huggingface/transformers": "^3.0.0-alpha.14",
Replace import reference to @xenova in worker.js with @huggingface
Run it in your favorite browser for a while (the longer, the better)
Open the Tab and hit "Load Model"
Wait for a good while with the mic running
Observe e.g. Chrome GPU Memory usage increase forever.

The screen below is from activity monitor on MacOS after 2 minutes of running the model in Chrome

MatteoFasulo · 2024-09-06T09:23:10Z

Sorry for the late response, and I can take a look at this. @MatteoFasulo Could you please share a simple example that could reproduce this issue? Thanks!

Hi,
The steps shared by @deanihansen are quite similar to what I was doing and experiencing. The GPU memory usage keeps increasing indefinitely, even after a task has completed its computation.

Indeed, you can follow those steps to reproduce the error and investigate the potential cause.

As a side note, in my specific case, I also attempted to call the .dispose() method to release the tensors. According to the ONNX documentation, this should remove the data by deleting its internal reference if it's on the CPU, or by releasing the data if it's on the GPU.

After calling this function, the tensor is no longer valid, and its location is set to 'none'.

Additionally, I noticed there is a .release() method for InferenceSession in ONNX, which is supposed to release the inference session and its underlying resources.

xenova · 2024-09-06T10:08:07Z

@xenova is there anyone we can poke or ways we can support getting this looked into? It looks like v3 is right around the corner, and it would be incredible to have a solution for this before the release.

100% agree! I might have an idea what the problem is and how to fix this. Fingers crossed 🤞

flatsiedatsie · 2024-09-07T07:33:12Z

A new record :-)

deanihansen · 2024-09-08T14:26:49Z

Any thoughts on your end @gyagp?

gyagp · 2024-09-08T15:10:58Z

@deanihansen , thanks for the instructions, and I can already reproduce the issue at my side. Now I'm a bit stuck by ONNX Runtime build issue on Windows. I will put this as high priority. Please stay tuned.

gyagp · 2024-09-09T10:02:56Z

@xenova It seems you keep kv cache in gpu-buffer for better performance, but you don't call tensor.dispose() to release them?

xenova · 2024-09-09T10:10:08Z

@xenova It seems you keep kv cache in gpu-buffer for better performance, but you don't call tensor.dispose() to release them?

I should be: https://github.com/xenova/transformers.js/blob/86d6da468021dfd53d6abf35a67a8f77e7ced7c0/src/models.js#L1556-L1580

chandeldivyam · 2024-09-11T10:49:19Z

@MatteoFasulo Were you able to make it work with any solution or workaround?

MatteoFasulo · 2024-09-12T09:55:25Z

@MatteoFasulo Were you able to make it work with any solution or workaround?

Unfortunately, no, I haven't been able to make it work with any solution or workaround yet.

chandeldivyam · 2024-09-12T15:46:35Z

@MatteoFasulo one thing which I was planning on doing was we run it in a background worker (I am building an extension so offscreen tab) and we kill that tab (worker in web) and respawn it. It would increase the time as model needs to be loaded. Also, we will have to do proper batching before we are able to send it using some VAD.

This is like a brute force approach, what are the things which could go wrong here? [Ps: This will be still much faster than using CPU]

MatteoFasulo · 2024-09-13T12:21:43Z

@MatteoFasulo one thing which I was planning on doing was we run it in a background worker (I am building an extension so offscreen tab) and we kill that tab (worker in web) and respawn it. It would increase the time as model needs to be loaded. Also, we will have to do proper batching before we are able to send it using some VAD.

This is like a brute force approach, what are the things which could go wrong here? [Ps: This will be still much faster than using CPU]

That sounds reasonable, but I would suggest focusing on fixing the issue within the framework itself rather than creating external workarounds (though what you're doing now is fine as a temporary solution). I still believe there might be an error during computation that isn't properly disposing of tensors, leading to gradual memory allocation increases over time.

gyagp · 2024-09-13T13:04:52Z

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage.
More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management.
I think it's not hard for @xenova to fix this as he already has the related code.

MatteoFasulo · 2024-09-15T15:42:39Z

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage. More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management. I think it's not hard for @xenova to fix this as he already has the related code.

Ok, thanks for the clarification. I'll wait for that to be fixed by @xenova :)

xenova · 2024-09-15T15:47:12Z

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage.

More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management.

I think it's not hard for @xenova to fix this as he already has the related code.

Can you suggest where in the code a call to dispose() is being missed? See #860 (comment); this should already be handled during generation.

xenova · 2024-09-16T20:36:54Z

Okay, I believe I have figured it out. Basically, I was only freeing the decoder PKVs after generation, and not the encoder PKVs (since they are re-used), but should be freed once no longer needed (after the last token is generated).

I will push the update soon for testing.

gyagp · 2024-09-17T05:28:08Z

Sorry @xenova , I misunderstood your comment above, and thought you already knew the root cause.
I should be more explicit. What I meant is replaceTensors() in models.js, which should also dispose the tensor (kv cache) in previous runs. Hopefully we can have an unified solution for all the related code in Transformers.js.

xenova · 2024-09-17T21:45:22Z

@gyagp Here's my attempted fix: 969d10e

but I don't think it fixes it entirely (cc @flatsiedatsie maybe you can test?). I also disabled the encoder being output on the GPU since I think this is where the leak comes from.

Install via source:

npm install xenova/transformers.js#v3

gyagp · 2024-09-19T10:16:56Z

Current ORT WebGPU implementation is not optimal, and we couldn't reuse the input gpu buffer for output. This means for kv cache, though we could keep them in gpu buffer to save copies, we still need to dispose unused ones explicitly to avoid memory leak.
A typical usage pattern is like:
We set preferredOutputLocation as "gpu-buffer" to keep outputs (present*) in GPU when creating the session.
1st run: Inputs are in CPU, and we feed the initial data.
2nd run: Assign inputs (past_key_values*) with the corresponding outputs from 1st run. Note that inputs are in GPU from now on.
From 3rd run: Dispose inputs in previous run, and assign inputs of this run to outputs from the previous run, like we did in 2nd run.
Final run: Dispose previous inputs and this run's outputs.

We need to further optimize this to reuse the gpu buffer, but developers may need a way to tell if it's a reused buffer or a new buffer.

@guschmue @fs-eire, please correct me if my understanding is wrong.

flatsiedatsie · 2024-09-19T16:29:09Z

It seems to have worked!

I'm transcribing a large video file, and memory use remains constant.

fs-eire · 2024-09-19T18:42:13Z

Current ORT WebGPU implementation is not optimal, and we couldn't reuse the input gpu buffer for output. This means for kv cache, though we could keep them in gpu buffer to save copies, we still need to dispose unused ones explicitly to avoid memory leak. A typical usage pattern is like: We set preferredOutputLocation as "gpu-buffer" to keep outputs (present*) in GPU when creating the session. 1st run: Inputs are in CPU, and we feed the initial data. 2nd run: Assign inputs (past_key_values*) with the corresponding outputs from 1st run. Note that inputs are in GPU from now on. From 3rd run: Dispose inputs in previous run, and assign inputs of this run to outputs from the previous run, like we did in 2nd run. Final run: Dispose previous inputs and this run's outputs.

We need to further optimize this to reuse the gpu buffer, but developers may need a way to tell if it's a reused buffer or a new buffer.

@guschmue @fs-eire, please correct me if my understanding is wrong.

Currently, the rule of tensor's lifecycle is straightforward (from ORT's perspective):

anything created by caller (eg. Tensor.fromGpuBuffer()) should be release by caller.
if a GPU tensor is created by ORT (as output), then user need to call dispose() on it when finished using it.

ort-web supports to specify pre-allocated tensor as output (not via preferredOutputLocation). If you know the type and shape of an output in advance, it's allowed to create a GPU tensor and call session.run with fetches so that onnxruntime-web will try to use that tensor. It may work if specifying a user created tensor using the same underlying GPU buffer for input and output.

I think ORT explicitly disallow its allocator to reuse input's buffer as output's buffer (when preferredOutputLocation === 'gpu-buffer'). There are a few reasons:

user may want to use the input's buffer after inference, or at least user may assume the input is not changed.
it's hard to determine whether the input's buffer can be reused or not, depending on output's type and shape. it's also hard to determine which output reuses which input if not specified explicitly by user.
it will make the lifecycle rule above more complicated, as in (2) user need to call dispose() only on the outputs that didn't reuse input's buffer

gyagp · 2024-09-20T03:17:25Z

@fs-eire Thanks for the clarification, which is very helpful! I agree to reuse the input buffer as output buffer will cause a lot of complexity. However, kv cache is fundamental for transformers based models, so to reuse the buffer will bring a good perf gain. This could be another topic we need to discuss further, but it's not related to this issue.
So @xenova , I think you may need to change your current design and treat decoder and encoder the same way (keep both outputs at GPU side).

flatsiedatsie · 2024-09-20T07:02:55Z

I just noticed this while running Moondream 2:

It seems to have stopped after generating a single word.

It seems to happen when two Transformers.js workers are loaded simultanously.

Using Whisper to transcribe voice (whisper remains loaded after the inference is complete).
Feeding the transcription as a prompt to Moondream2, running in a separate worker.

I'll quickly check if this is related to the commit or was always an issue that I just hadn't spotted.

// Nope, with the old version the issue does not occur.

Did another test: having Whisper and the Translation worker running simultaneously is not an issue.

// Just noticed that it has happened to the Whisper worker too:

xenova · 2024-09-22T01:52:35Z

So @xenova , I think you may need to change your current design and treat decoder and encoder the same way (keep both outputs at GPU side).

@gyagp I agree, however, as described above, this is what seems to be causing the memory leak.

gyagp · 2024-09-22T02:22:53Z

I think in getPastKeyValues(), kv caches for encoder are not disposed as expected. See below code:

// Optimization introduced by optimum to reuse past key values. So, we just replace the constant
// outputs with the previous past key values.
// https://github.com/huggingface/optimum/blob/0bf2c05fb7e1182b52d21b703cfc95fd9e4ea3dc/optimum/onnxruntime/base.py#L677-L704
pkvs[newName] = pastKeyValues[newName];

xenova · 2024-09-22T02:27:59Z

This is because after the first run, the decoder produces an empty tensor for encoder PKVs (and we reuse the first encoder PKVs, so we should not dispose them until the end).

gyagp · 2024-09-22T02:42:02Z

Above code and below code (A bug in ORT?) look a bit strange to me. I need to dig a bit more about the code next week.

// (otherwise, this causes a memory leak or throws an error "Error: previous buffer is not registered")
if (key.includes('encoder')) continue;

flatsiedatsie · 2024-09-27T07:37:43Z

@gyagp Did you per chance manage to find anything?

gyagp · 2024-09-29T08:07:58Z

This is because after the first run, the decoder produces an empty tensor for encoder PKVs (and we reuse the first encoder PKVs, so we should not dispose them until the end).

Debugged the code today, and I think the current code has no memory leak.
I also tried with the latest ORT code, and "Error: previous buffer is not registered" should have been fixed by microsoft/onnxruntime#22254 (https://www.npmjs.com/package/onnxruntime-web/v/1.20.0-dev.20240928-1bda91fc57 should already include this fix). So "if (key.includes('encoder')) continue;" in models.js is no longer needed and we can also keep encoder KVs on GPU, which is more performant.

flatsiedatsie · 2024-09-29T13:22:03Z

Interesting.

I tried a long transcription yesterday, and the memory use was high, but didn't seem to slowly grow. I checked, because I was worried that switching to Alpha 17 - away from the a version with the test-fix that I had been relying on - would cause trouble.

xenova · 2024-09-30T11:55:20Z

This should now be fixed by https://www.npmjs.com/package/@huggingface/transformers/v/3.0.0-alpha.19! 🥳

MatteoFasulo · 2024-09-30T13:01:46Z

This should now be fixed by https://www.npmjs.com/package/@huggingface/transformers/v/3.0.0-alpha.19! 🥳

Well done, thank you @xenova!

MatteoFasulo added the bug Something isn't working label Jul 23, 2024

xenova closed this as completed Sep 30, 2024

[Severe] Memory leak issue under WebGPU Whisper transcribe pipeline #860

[Severe] Memory leak issue under WebGPU Whisper transcribe pipeline #860

Comments

MatteoFasulo commented Jul 23, 2024 • edited Loading

System Info

Environment/Platform

Description

Reproduction

Ideas (from Ratchet)

flatsiedatsie commented Jul 30, 2024

MatteoFasulo commented Aug 2, 2024

flatsiedatsie commented Aug 14, 2024

MatteoFasulo commented Aug 15, 2024

xenova commented Aug 15, 2024

flatsiedatsie commented Aug 15, 2024 • edited Loading

flatsiedatsie commented Aug 15, 2024 • edited Loading

MatteoFasulo commented Aug 16, 2024

MatteoFasulo commented Aug 25, 2024

deanihansen commented Sep 5, 2024

gyagp commented Sep 6, 2024

deanihansen commented Sep 6, 2024

MatteoFasulo commented Sep 6, 2024 • edited Loading

xenova commented Sep 6, 2024

flatsiedatsie commented Sep 7, 2024

deanihansen commented Sep 8, 2024

gyagp commented Sep 8, 2024

gyagp commented Sep 9, 2024

xenova commented Sep 9, 2024

chandeldivyam commented Sep 11, 2024

MatteoFasulo commented Sep 12, 2024

chandeldivyam commented Sep 12, 2024

MatteoFasulo commented Sep 13, 2024

gyagp commented Sep 13, 2024

MatteoFasulo commented Sep 15, 2024

xenova commented Sep 15, 2024

xenova commented Sep 16, 2024

gyagp commented Sep 17, 2024

xenova commented Sep 17, 2024 • edited Loading

gyagp commented Sep 19, 2024

flatsiedatsie commented Sep 19, 2024 • edited Loading

fs-eire commented Sep 19, 2024

gyagp commented Sep 20, 2024

flatsiedatsie commented Sep 20, 2024 • edited Loading

xenova commented Sep 22, 2024

gyagp commented Sep 22, 2024

xenova commented Sep 22, 2024

gyagp commented Sep 22, 2024

flatsiedatsie commented Sep 27, 2024

gyagp commented Sep 29, 2024

flatsiedatsie commented Sep 29, 2024

xenova commented Sep 30, 2024

MatteoFasulo commented Sep 30, 2024

MatteoFasulo commented Jul 23, 2024 •

edited

Loading

flatsiedatsie commented Aug 15, 2024 •

edited

Loading

flatsiedatsie commented Aug 15, 2024 •

edited

Loading

MatteoFasulo commented Sep 6, 2024 •

edited

Loading

xenova commented Sep 17, 2024 •

edited

Loading

flatsiedatsie commented Sep 19, 2024 •

edited

Loading

flatsiedatsie commented Sep 20, 2024 •

edited

Loading