Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Memory blowup #10

Open
danpovey opened this issue Nov 13, 2020 · 8 comments
Open

Memory blowup #10

danpovey opened this issue Nov 13, 2020 · 8 comments

Comments

@danpovey
Copy link
Contributor

Right now we are dealing with an issue in train.py where it uses more and more memory. It seems like stuff isn't getting freed that should be getting freed.

@danpovey
Copy link
Contributor Author

This seems to be some kind of circular reference between the Python Fsa object and the function object for get_tot_scores (or its ctx), whereby an Fsa and the most recent _GetTotScoresFunction used on it are not deleted. Still debugging.
Can be reproduced just in get_tot_scores_test.py.

@qindazhu
Copy link
Collaborator

It's strange that Fsa._grad_cache is not kept in the leaked memory?

(Pdb) leftover[2].byrcs[6].referrers.byrcs[0].referents
Partition of a set of 4 objects. Total size = 900 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      2  50      816  91       816  91 collections.OrderedDict
     1      1  25       56   6       872  97 _k2.RaggedArc
     2      1  25       28   3       900 100 int
(Pdb) leftover[2].byrcs[6].referrers.byrcs[0].referents.byvia
Partition of a set of 4 objects. Total size = 900 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1  25      424  47       424  47 "['_tensor_attr']"
     1      1  25      392  44       816  91 "['_non_tensor_attr']"
     2      1  25       56   6       872  97 "['arcs']"
     3      1  25       28   3       900 100 "['_properties']"

@danpovey
Copy link
Contributor Author

It maybe just isn't being printed for some reason. It seems the problem is that the Fsa has the attribute e.g. 'tot_scores_tropical' which has a grad_fn _GetTotScoresFunctionBackward, which has a reference to the Fsa in its ctx.

@csukuangfj
Copy link
Collaborator

So what if we invoke del ctx.fsa inside the backward function to break the circular reference chain?

@danpovey
Copy link
Contributor Author

That's an interesting idea but I don't like the solution because it will cause a leak if someone doesn't end up calling backward, e.g. because of a problem that required abandoning the minibatch.

@danpovey
Copy link
Contributor Author

I am leaning towards, in the short term, just not having the FSA cache the total scores.
Also, how do you guys think about renaming update_xxx to get_xxx? It seems to me that they are not just updating
them, they are also returning them, so get is a better name.

@csukuangfj
Copy link
Collaborator

Renaming is fine with me.

@danpovey
Copy link
Contributor Author

I am working on this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants