-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[aot.export] Potential Memory Leak #281
Comments
Oops. Thanks for the analysis. I suspect what is going on is that internally, the mechanism creates a new class object that is not being immediately collected, and that class object may be inadvertently holding on to the state dict. As further evidence, we may want to insert an explicit call to We're also slowly moving to get rid of that lower level class mechanism, which was needed to bridge certain programming model issues in the early days. This analysis may raise the priority of that -- but I also expect there is more of an easy hack that breaks the cycle when done with export. Just need to find the cycle. |
thanks for the reply! I think so too, the As per the |
Something nefarious is going on if GC.collect doesn't get it because that should be handing cycles. This likely means that there is an unintended strong reference, not just a cycle. |
looking into it |
@stellaraccident |
Well this is a mystery indeed. I'm afraid I don't have another theory without putting hands on it. But if I were fishing, I would look at class garbage collection |
So I've had the time to do a little more digging, I believe the issue is caused by the If we were to run the above reproducer, the Now I'm not entirely sure why this happens, but the However, if I were to run the reproducer without the above 2 lines that register the finalizer, it passes the tests and the Here is a memory usage graph of that, the last dip corresponds to a Now, I don't know why that |
@stellaraccident Could you maybe help out here? We are not fully sure if this will still will work as you intended with this change. Thanks! |
Thanks for identifying the smoking gun. I'm on vacation for the next two weeks but will definitely take the time to follow your analysis and get a solution landed when I am able. I recall there being trickiness with that finalizer but I will need to refresh state to think through it properly. |
Hi @stellaraccident , did you have time to look at this? Is there indeed some trickiness with the finalizer? If not, I would raise a PR to |
Thanks for the reminder: I remembered this was outstanding but got buried after vacation. I'll have a look this weekend or first thing next week. |
Hi,
I was trying to run a benchmark suite that involves exporting multiple
torch.nn.Module
s and realised that theaot.export()
function might be causing a memory leak, resulting in thestate_dict
of thenn.Module
and theExportedProgram
not being released even they shouldn't be referenced anymore.A concrete and minimal reproducer of the problem:
In the second iteration of the loop, one would expect the first model and the exported program objects be released from the memory, even though running the program with
memray
begs to differ:and the functions that allocate memory that hasn't been released within the time frame:
There are 2 copies of the
state_dict
andExportedProgram
that are being kept in memory above but to better observe the (de)allocations, the abovememray
graphs can also be reproduced as follows (one has to install the dependencies ofiree-turbine
+memray
):Now I would like to take on the issue myself but before I dive into it, I wanted to ask for any pointers that can be useful, or if I'm missing a point? Any help or pointer to where the problem might be is much appreciated.
P.S.: One can use any
nn.Module
, I used the resnet-50 because it is big enough to observe the memray graph.The text was updated successfully, but these errors were encountered: