Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Severe] Memory leak issue under WebGPU Whisper transcribe pipeline #860

Closed
1 of 5 tasks
MatteoFasulo opened this issue Jul 23, 2024 · 43 comments
Closed
1 of 5 tasks
Labels
bug Something isn't working

Comments

@MatteoFasulo
Copy link

MatteoFasulo commented Jul 23, 2024

System Info

Using transformers.js v3 in latest Chrome release on Windows 10.

GPU: Nvidia GTX 1080 (8GB)

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

Transcribe using Whisper model with WebGPU does not dispose the tensor after finishing the pipeline. Checked that in nvidia-smi while transcribing a .wav file into text. Memory consumption keeps growing until either it goes out-of memory (for smaller GPUs) or looses the device meanwhile the computation is going (producing console error saying 'device is lost').

Reproduction

  1. Use transcribe pipeline for transcribing a wav file of at least 1 minute
  2. Check GPU memory consumption while doing the computation (should increase meanwhile the computation is going on)
  3. Verify if the tensor is correctly being disposed after the computation is done (here actually it does not dispose the data in the GPU hence generating leaks)
  4. Can be easily verified using longer audio sequences (they enlarge the difference between resting GPU memory and meanwhile the computation)

Ideas (from Ratchet)

I spotted this great architecture of inference at https://github.com/huggingface/ratchet/blob/master/ARCHITECTURE.md in which the memory consumption for encoder-decoder model like Whisper is reduced by supporting both static & dynamic graphs as to have encoder completely static and decoder running under a dynamic graph due to KV caching.

@MatteoFasulo MatteoFasulo added the bug Something isn't working label Jul 23, 2024
@flatsiedatsie
Copy link
Contributor

I'm seeing an error that only pops up after a while. Perhaps it's related?

Screenshot 2024-07-29 at 20 39 07

@MatteoFasulo
Copy link
Author

I'm seeing an error that only pops up after a while. Perhaps it's related?

Screenshot 2024-07-29 at 20 39 07

Could be related, in my case it is strictly connected with very long audio sequences (up to 10 minutes).

@flatsiedatsie
Copy link
Contributor

I think I'm seeing it too.

Screenshot 2024-08-14 at 14 01 58

Mac OS, Brave browser, Transformers.js V3.

@MatteoFasulo
Copy link
Author

I think I'm seeing it too.

Screenshot 2024-08-14 at 14 01 58

Mac OS, Brave browser, Transformers.js V3.

Yes, if you want to double check try to use nvidia-smi from your command terminal and verify the GPU memory consumption.

@xenova
Copy link
Collaborator

xenova commented Aug 15, 2024

Thanks to everyone for testing! Perhaps @guschmue and the ORT team can do some additional profiling to see what's going wrong.

@flatsiedatsie
Copy link
Contributor

flatsiedatsie commented Aug 15, 2024

@MatteoFasulo I'm using a Macbook, so I don't have that option.

Also, my pipeline uses three models at the same time, which could perhaps play a role. My whisper worker is also using segmentation and verification models at the same time as whisper timestamped. Though those run via WASM, so.. hmm.

class PipelineSingleton {
    static asr_model_id = 'onnx-community/whisper-base_timestamped';
    static instance = null;
    static asr_instance = null;

    static segmentation_model_id = 'onnx-community/pyannote-segmentation-3.0';
    static segmentation_instance = null;
    static segmentation_processor = null;
	
	static verification_model_id = 'Xenova/wavlm-base-plus-sv'; // Xenova/wavlm-base-plus-sv
    //static verification_model_id = 'onnx-community/wespeaker-voxceleb-resnet34-LM';
    static verification_instance = null;
    static verification_processor = null;
	

    static async getInstance(progress_callback = null,model_name='onnx-community/whisper-base_timestamped',preferences={}) {
		//console.log("Whisper_worker: Pipeline: getInstance:  model_name, preferences: ", model_name, preferences);
		this.asr_model_id = model_name;

		PER_DEVICE_CONFIG[self.device] = {...PER_DEVICE_CONFIG[self.device],preferences}
		
        this.asr_instance ??= pipeline('automatic-speech-recognition', this.asr_model_id, {
            ...PER_DEVICE_CONFIG[self.device],
            progress_callback,
        });

        this.segmentation_processor ??= AutoProcessor.from_pretrained(this.segmentation_model_id, {
			...preferences,
            progress_callback,
        });
        this.segmentation_instance ??= AutoModelForAudioFrameClassification.from_pretrained(this.segmentation_model_id, {
            // NOTE: WebGPU is not currently supported for this model
            // See https://github.com/microsoft/onnxruntime/issues/21386
            device: 'wasm',
            dtype: 'fp32',
			...preferences,
            progress_callback,
        });
		
        this.verification_processor ??= AutoProcessor.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
			...preferences,
            progress_callback,
        });
		
        this.verification_instance ??= AutoModel.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
			...preferences,
            progress_callback,
        });

        return Promise.all([this.asr_instance, this.segmentation_processor, this.segmentation_instance, this.verification_processor, this.verification_instance]);
    }
}

@flatsiedatsie
Copy link
Contributor

flatsiedatsie commented Aug 15, 2024

I created this image earlier to ask how I could reduce Whisper memory use after an inference was complete.

whisper_memory_cleaning

But I wasn't sure if the issue was Whisper, or the Segmentation or voice fingerprinting model. Still, perhaps it's related.

@MatteoFasulo
Copy link
Author

I created this image earlier to ask how I could reduce Whisper memory use after an inference was complete.

whisper_memory_cleaning

But I wasn't sure if the issue was Whisper, or the Segmentation or voice fingerprinting model. Still, perhaps it's related.

Right now I do not have all the data but in my case the dispose was not working as intended. I tried to apply dispose at the end of each task but the GPU memory consumption was still the same as if the dispose was not applied.

@MatteoFasulo
Copy link
Author

@xenova I tested default whisper-tiny as well as whisper-small.en_timestamped using the ORT Web Perf tool by @guschmue .

However, the tool does not provide any information about memory footprint nor GPU memory consumption.

Do you have any suggestion about checking if the error depends only on Whisper models or if there is something else which does not clear used memory before starting new tasks?

@deanihansen
Copy link

@xenova is there anyone we can poke or ways we can support getting this looked into? It looks like v3 is right around the corner, and it would be incredible to have a solution for this before the release.

@gyagp
Copy link

gyagp commented Sep 6, 2024

Sorry for the late response, and I can take a look at this. @MatteoFasulo Could you please share a simple example that could reproduce this issue? Thanks!

@deanihansen
Copy link

@gyagp I've repro'ed this locally by:

  1. Pull this example here https://github.com/xenova/transformers.js/tree/v3/examples/webgpu-whisper
  2. Update the example's references to transformers by..
  3. Replace the package.json reference for '"@xenova/transformers' with the latest "@huggingface/transformers": "^3.0.0-alpha.14",
  4. Replace import reference to @xenova in worker.js with @huggingface
  5. Run it in your favorite browser for a while (the longer, the better)
  6. Open the Tab and hit "Load Model"
  7. Wait for a good while with the mic running
  8. Observe e.g. Chrome GPU Memory usage increase forever.

The screen below is from activity monitor on MacOS after 2 minutes of running the model in Chrome
image

@MatteoFasulo
Copy link
Author

MatteoFasulo commented Sep 6, 2024

Sorry for the late response, and I can take a look at this. @MatteoFasulo Could you please share a simple example that could reproduce this issue? Thanks!

Hi,
The steps shared by @deanihansen are quite similar to what I was doing and experiencing. The GPU memory usage keeps increasing indefinitely, even after a task has completed its computation.

Indeed, you can follow those steps to reproduce the error and investigate the potential cause.

As a side note, in my specific case, I also attempted to call the .dispose() method to release the tensors. According to the ONNX documentation, this should remove the data by deleting its internal reference if it's on the CPU, or by releasing the data if it's on the GPU.

After calling this function, the tensor is no longer valid, and its location is set to 'none'.

Additionally, I noticed there is a .release() method for InferenceSession in ONNX, which is supposed to release the inference session and its underlying resources.

@xenova
Copy link
Collaborator

xenova commented Sep 6, 2024

@xenova is there anyone we can poke or ways we can support getting this looked into? It looks like v3 is right around the corner, and it would be incredible to have a solution for this before the release.

100% agree! I might have an idea what the problem is and how to fix this. Fingers crossed 🤞

@flatsiedatsie
Copy link
Contributor

A new record :-)

Screenshot 2024-09-06 at 20 00 16

@deanihansen
Copy link

Any thoughts on your end @gyagp?

@gyagp
Copy link

gyagp commented Sep 8, 2024

@deanihansen , thanks for the instructions, and I can already reproduce the issue at my side. Now I'm a bit stuck by ONNX Runtime build issue on Windows. I will put this as high priority. Please stay tuned.

@gyagp
Copy link

gyagp commented Sep 9, 2024

@xenova It seems you keep kv cache in gpu-buffer for better performance, but you don't call tensor.dispose() to release them?

@xenova
Copy link
Collaborator

xenova commented Sep 9, 2024

@xenova It seems you keep kv cache in gpu-buffer for better performance, but you don't call tensor.dispose() to release them?

I should be: https://github.com/xenova/transformers.js/blob/86d6da468021dfd53d6abf35a67a8f77e7ced7c0/src/models.js#L1556-L1580

@chandeldivyam
Copy link

@MatteoFasulo Were you able to make it work with any solution or workaround?

@MatteoFasulo
Copy link
Author

@MatteoFasulo Were you able to make it work with any solution or workaround?

Unfortunately, no, I haven't been able to make it work with any solution or workaround yet.

@chandeldivyam
Copy link

@MatteoFasulo one thing which I was planning on doing was we run it in a background worker (I am building an extension so offscreen tab) and we kill that tab (worker in web) and respawn it. It would increase the time as model needs to be loaded. Also, we will have to do proper batching before we are able to send it using some VAD.

This is like a brute force approach, what are the things which could go wrong here? [Ps: This will be still much faster than using CPU]

@MatteoFasulo
Copy link
Author

@MatteoFasulo one thing which I was planning on doing was we run it in a background worker (I am building an extension so offscreen tab) and we kill that tab (worker in web) and respawn it. It would increase the time as model needs to be loaded. Also, we will have to do proper batching before we are able to send it using some VAD.

This is like a brute force approach, what are the things which could go wrong here? [Ps: This will be still much faster than using CPU]

That sounds reasonable, but I would suggest focusing on fixing the issue within the framework itself rather than creating external workarounds (though what you're doing now is fine as a temporary solution). I still believe there might be an error during computation that isn't properly disposing of tensors, leading to gradual memory allocation increases over time.

@gyagp
Copy link

gyagp commented Sep 13, 2024

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage.
More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management.
I think it's not hard for @xenova to fix this as he already has the related code.

@MatteoFasulo
Copy link
Author

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage. More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management. I think it's not hard for @xenova to fix this as he already has the related code.

Ok, thanks for the clarification. I'll wait for that to be fixed by @xenova :)

@xenova
Copy link
Collaborator

xenova commented Sep 15, 2024

Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage.

More details about onnxruntime gpu tensor lifecycle management can be found at https://onnxruntime.ai/docs/tutorials/web/ep-webgpu.html#gpu-tensor-life-cycle-management.

I think it's not hard for @xenova to fix this as he already has the related code.

Can you suggest where in the code a call to dispose() is being missed? See #860 (comment); this should already be handled during generation.

@xenova
Copy link
Collaborator

xenova commented Sep 16, 2024

Okay, I believe I have figured it out. Basically, I was only freeing the decoder PKVs after generation, and not the encoder PKVs (since they are re-used), but should be freed once no longer needed (after the last token is generated).

I will push the update soon for testing.

@gyagp
Copy link

gyagp commented Sep 17, 2024

Sorry @xenova , I misunderstood your comment above, and thought you already knew the root cause.
I should be more explicit. What I meant is replaceTensors() in models.js, which should also dispose the tensor (kv cache) in previous runs. Hopefully we can have an unified solution for all the related code in Transformers.js.

@xenova
Copy link
Collaborator

xenova commented Sep 17, 2024

@gyagp Here's my attempted fix: 969d10e

but I don't think it fixes it entirely (cc @flatsiedatsie maybe you can test?). I also disabled the encoder being output on the GPU since I think this is where the leak comes from.

Install via source:

npm install xenova/transformers.js#v3

@gyagp
Copy link

gyagp commented Sep 19, 2024

Current ORT WebGPU implementation is not optimal, and we couldn't reuse the input gpu buffer for output. This means for kv cache, though we could keep them in gpu buffer to save copies, we still need to dispose unused ones explicitly to avoid memory leak.
A typical usage pattern is like:
We set preferredOutputLocation as "gpu-buffer" to keep outputs (present*) in GPU when creating the session.
1st run: Inputs are in CPU, and we feed the initial data.
2nd run: Assign inputs (past_key_values*) with the corresponding outputs from 1st run. Note that inputs are in GPU from now on.
From 3rd run: Dispose inputs in previous run, and assign inputs of this run to outputs from the previous run, like we did in 2nd run.
Final run: Dispose previous inputs and this run's outputs.

We need to further optimize this to reuse the gpu buffer, but developers may need a way to tell if it's a reused buffer or a new buffer.

@guschmue @fs-eire, please correct me if my understanding is wrong.

@flatsiedatsie
Copy link
Contributor

flatsiedatsie commented Sep 19, 2024

It seems to have worked!

I'm transcribing a large video file, and memory use remains constant.

Screenshot 2024-09-19 at 18 29 43

@fs-eire
Copy link
Contributor

fs-eire commented Sep 19, 2024

Current ORT WebGPU implementation is not optimal, and we couldn't reuse the input gpu buffer for output. This means for kv cache, though we could keep them in gpu buffer to save copies, we still need to dispose unused ones explicitly to avoid memory leak. A typical usage pattern is like: We set preferredOutputLocation as "gpu-buffer" to keep outputs (present*) in GPU when creating the session. 1st run: Inputs are in CPU, and we feed the initial data. 2nd run: Assign inputs (past_key_values*) with the corresponding outputs from 1st run. Note that inputs are in GPU from now on. From 3rd run: Dispose inputs in previous run, and assign inputs of this run to outputs from the previous run, like we did in 2nd run. Final run: Dispose previous inputs and this run's outputs.

We need to further optimize this to reuse the gpu buffer, but developers may need a way to tell if it's a reused buffer or a new buffer.

@guschmue @fs-eire, please correct me if my understanding is wrong.

Currently, the rule of tensor's lifecycle is straightforward (from ORT's perspective):

  1. anything created by caller (eg. Tensor.fromGpuBuffer()) should be release by caller.
  2. if a GPU tensor is created by ORT (as output), then user need to call dispose() on it when finished using it.

ort-web supports to specify pre-allocated tensor as output (not via preferredOutputLocation). If you know the type and shape of an output in advance, it's allowed to create a GPU tensor and call session.run with fetches so that onnxruntime-web will try to use that tensor. It may work if specifying a user created tensor using the same underlying GPU buffer for input and output.

I think ORT explicitly disallow its allocator to reuse input's buffer as output's buffer (when preferredOutputLocation === 'gpu-buffer'). There are a few reasons:

  • user may want to use the input's buffer after inference, or at least user may assume the input is not changed.
  • it's hard to determine whether the input's buffer can be reused or not, depending on output's type and shape. it's also hard to determine which output reuses which input if not specified explicitly by user.
  • it will make the lifecycle rule above more complicated, as in (2) user need to call dispose() only on the outputs that didn't reuse input's buffer

@gyagp
Copy link

gyagp commented Sep 20, 2024

@fs-eire Thanks for the clarification, which is very helpful! I agree to reuse the input buffer as output buffer will cause a lot of complexity. However, kv cache is fundamental for transformers based models, so to reuse the buffer will bring a good perf gain. This could be another topic we need to discuss further, but it's not related to this issue.
So @xenova , I think you may need to change your current design and treat decoder and encoder the same way (keep both outputs at GPU side).

@flatsiedatsie
Copy link
Contributor

flatsiedatsie commented Sep 20, 2024

I just noticed this while running Moondream 2:

Screenshot 2024-09-20 at 09 01 34

It seems to have stopped after generating a single word.

It seems to happen when two Transformers.js workers are loaded simultanously.

  • Using Whisper to transcribe voice (whisper remains loaded after the inference is complete).
  • Feeding the transcription as a prompt to Moondream2, running in a separate worker.

I'll quickly check if this is related to the commit or was always an issue that I just hadn't spotted.

// Nope, with the old version the issue does not occur.

Did another test: having Whisper and the Translation worker running simultaneously is not an issue.

// Just noticed that it has happened to the Whisper worker too:

Screenshot 2024-09-20 at 10 11 28

@xenova
Copy link
Collaborator

xenova commented Sep 22, 2024

So @xenova , I think you may need to change your current design and treat decoder and encoder the same way (keep both outputs at GPU side).

@gyagp I agree, however, as described above, this is what seems to be causing the memory leak.

@gyagp
Copy link

gyagp commented Sep 22, 2024

I think in getPastKeyValues(), kv caches for encoder are not disposed as expected. See below code:

// Optimization introduced by optimum to reuse past key values. So, we just replace the constant
// outputs with the previous past key values.
// https://github.com/huggingface/optimum/blob/0bf2c05fb7e1182b52d21b703cfc95fd9e4ea3dc/optimum/onnxruntime/base.py#L677-L704
pkvs[newName] = pastKeyValues[newName];

@xenova
Copy link
Collaborator

xenova commented Sep 22, 2024

This is because after the first run, the decoder produces an empty tensor for encoder PKVs (and we reuse the first encoder PKVs, so we should not dispose them until the end).

@gyagp
Copy link

gyagp commented Sep 22, 2024

Above code and below code (A bug in ORT?) look a bit strange to me. I need to dig a bit more about the code next week.

// (otherwise, this causes a memory leak or throws an error "Error: previous buffer is not registered")
if (key.includes('encoder')) continue;

@flatsiedatsie
Copy link
Contributor

@gyagp Did you per chance manage to find anything?

@gyagp
Copy link

gyagp commented Sep 29, 2024

This is because after the first run, the decoder produces an empty tensor for encoder PKVs (and we reuse the first encoder PKVs, so we should not dispose them until the end).

Debugged the code today, and I think the current code has no memory leak.
I also tried with the latest ORT code, and "Error: previous buffer is not registered" should have been fixed by microsoft/onnxruntime#22254 (https://www.npmjs.com/package/onnxruntime-web/v/1.20.0-dev.20240928-1bda91fc57 should already include this fix). So "if (key.includes('encoder')) continue;" in models.js is no longer needed and we can also keep encoder KVs on GPU, which is more performant.

@flatsiedatsie
Copy link
Contributor

Interesting.

I tried a long transcription yesterday, and the memory use was high, but didn't seem to slowly grow. I checked, because I was worried that switching to Alpha 17 - away from the a version with the test-fix that I had been relying on - would cause trouble.

@xenova
Copy link
Collaborator

xenova commented Sep 30, 2024

This should now be fixed by https://www.npmjs.com/package/@huggingface/transformers/v/3.0.0-alpha.19! 🥳

@xenova xenova closed this as completed Sep 30, 2024
@MatteoFasulo
Copy link
Author

This should now be fixed by https://www.npmjs.com/package/@huggingface/transformers/v/3.0.0-alpha.19! 🥳

Well done, thank you @xenova!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants