-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Severe] Memory leak issue under WebGPU Whisper transcribe pipeline #860
Comments
Thanks to everyone for testing! Perhaps @guschmue and the ORT team can do some additional profiling to see what's going wrong. |
@MatteoFasulo I'm using a Macbook, so I don't have that option. Also, my pipeline uses three models at the same time, which could perhaps play a role. My whisper worker is also using segmentation and verification models at the same time as whisper timestamped. Though those run via WASM, so.. hmm.
|
Right now I do not have all the data but in my case the dispose was not working as intended. I tried to apply dispose at the end of each task but the GPU memory consumption was still the same as if the dispose was not applied. |
@xenova I tested default whisper-tiny as well as whisper-small.en_timestamped using the ORT Web Perf tool by @guschmue . However, the tool does not provide any information about memory footprint nor GPU memory consumption. Do you have any suggestion about checking if the error depends only on Whisper models or if there is something else which does not clear used memory before starting new tasks? |
@xenova is there anyone we can poke or ways we can support getting this looked into? It looks like v3 is right around the corner, and it would be incredible to have a solution for this before the release. |
Sorry for the late response, and I can take a look at this. @MatteoFasulo Could you please share a simple example that could reproduce this issue? Thanks! |
@gyagp I've repro'ed this locally by:
The screen below is from activity monitor on MacOS after 2 minutes of running the model in Chrome |
Hi, Indeed, you can follow those steps to reproduce the error and investigate the potential cause. As a side note, in my specific case, I also attempted to call the After calling this function, the tensor is no longer valid, and its location is set to 'none'. Additionally, I noticed there is a |
100% agree! I might have an idea what the problem is and how to fix this. Fingers crossed 🤞 |
Any thoughts on your end @gyagp? |
@deanihansen , thanks for the instructions, and I can already reproduce the issue at my side. Now I'm a bit stuck by ONNX Runtime build issue on Windows. I will put this as high priority. Please stay tuned. |
@xenova It seems you keep kv cache in gpu-buffer for better performance, but you don't call tensor.dispose() to release them? |
|
@MatteoFasulo Were you able to make it work with any solution or workaround? |
Unfortunately, no, I haven't been able to make it work with any solution or workaround yet. |
@MatteoFasulo one thing which I was planning on doing was we run it in a background worker (I am building an extension so offscreen tab) and we kill that tab (worker in web) and respawn it. It would increase the time as model needs to be loaded. Also, we will have to do proper batching before we are able to send it using some VAD. This is like a brute force approach, what are the things which could go wrong here? [Ps: This will be still much faster than using CPU] |
That sounds reasonable, but I would suggest focusing on fixing the issue within the framework itself rather than creating external workarounds (though what you're doing now is fine as a temporary solution). I still believe there might be an error during computation that isn't properly disposing of tensors, leading to gradual memory allocation increases over time. |
Like I mentioned in above comment, the leak comes from kv cache. Currently for performance, transformers.js keeps kv cache in gpu-buffer, thus onnxruntime no long owns the related tensor and it's user's responsibility to release the buffer after its usage. |
Ok, thanks for the clarification. I'll wait for that to be fixed by @xenova :) |
Can you suggest where in the code a call to |
Okay, I believe I have figured it out. Basically, I was only freeing the decoder PKVs after generation, and not the encoder PKVs (since they are re-used), but should be freed once no longer needed (after the last token is generated). I will push the update soon for testing. |
Sorry @xenova , I misunderstood your comment above, and thought you already knew the root cause. |
@gyagp Here's my attempted fix: 969d10e but I don't think it fixes it entirely (cc @flatsiedatsie maybe you can test?). I also disabled the encoder being output on the GPU since I think this is where the leak comes from. Install via source: npm install xenova/transformers.js#v3 |
Current ORT WebGPU implementation is not optimal, and we couldn't reuse the input gpu buffer for output. This means for kv cache, though we could keep them in gpu buffer to save copies, we still need to dispose unused ones explicitly to avoid memory leak. We need to further optimize this to reuse the gpu buffer, but developers may need a way to tell if it's a reused buffer or a new buffer. @guschmue @fs-eire, please correct me if my understanding is wrong. |
Currently, the rule of tensor's lifecycle is straightforward (from ORT's perspective):
ort-web supports to specify pre-allocated tensor as output (not via I think ORT explicitly disallow its allocator to reuse input's buffer as output's buffer (when
|
@fs-eire Thanks for the clarification, which is very helpful! I agree to reuse the input buffer as output buffer will cause a lot of complexity. However, kv cache is fundamental for transformers based models, so to reuse the buffer will bring a good perf gain. This could be another topic we need to discuss further, but it's not related to this issue. |
I think in getPastKeyValues(), kv caches for encoder are not disposed as expected. See below code:
|
This is because after the first run, the decoder produces an empty tensor for encoder PKVs (and we reuse the first encoder PKVs, so we should not dispose them until the end). |
Above code and below code (A bug in ORT?) look a bit strange to me. I need to dig a bit more about the code next week.
|
@gyagp Did you per chance manage to find anything? |
Debugged the code today, and I think the current code has no memory leak. |
Interesting. I tried a long transcription yesterday, and the memory use was high, but didn't seem to slowly grow. I checked, because I was worried that switching to Alpha 17 - away from the a version with the test-fix that I had been relying on - would cause trouble. |
This should now be fixed by https://www.npmjs.com/package/@huggingface/transformers/v/3.0.0-alpha.19! 🥳 |
Well done, thank you @xenova! |
System Info
Using transformers.js v3 in latest Chrome release on Windows 10.
GPU: Nvidia GTX 1080 (8GB)
Environment/Platform
Description
Transcribe using Whisper model with WebGPU does not dispose the tensor after finishing the pipeline. Checked that in
nvidia-smi
while transcribing a .wav file into text. Memory consumption keeps growing until either it goes out-of memory (for smaller GPUs) or looses the device meanwhile the computation is going (producing console error saying 'device is lost').Reproduction
Ideas (from Ratchet)
I spotted this great architecture of inference at https://github.com/huggingface/ratchet/blob/master/ARCHITECTURE.md in which the memory consumption for encoder-decoder model like Whisper is reduced by supporting both static & dynamic graphs as to have encoder completely static and decoder running under a dynamic graph due to KV caching.
The text was updated successfully, but these errors were encountered: