Memory blowup #10

danpovey · 2020-11-13T13:14:33Z

Right now we are dealing with an issue in train.py where it uses more and more memory. It seems like stuff isn't getting freed that should be getting freed.

danpovey · 2020-11-14T07:20:08Z

This seems to be some kind of circular reference between the Python Fsa object and the function object for get_tot_scores (or its ctx), whereby an Fsa and the most recent _GetTotScoresFunction used on it are not deleted. Still debugging.
Can be reproduced just in get_tot_scores_test.py.

qindazhu · 2020-11-14T07:51:45Z

It's strange that Fsa._grad_cache is not kept in the leaked memory?

(Pdb) leftover[2].byrcs[6].referrers.byrcs[0].referents
Partition of a set of 4 objects. Total size = 900 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0      2  50      816  91       816  91 collections.OrderedDict
     1      1  25       56   6       872  97 _k2.RaggedArc
     2      1  25       28   3       900 100 int
(Pdb) leftover[2].byrcs[6].referrers.byrcs[0].referents.byvia
Partition of a set of 4 objects. Total size = 900 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1  25      424  47       424  47 "['_tensor_attr']"
     1      1  25      392  44       816  91 "['_non_tensor_attr']"
     2      1  25       56   6       872  97 "['arcs']"
     3      1  25       28   3       900 100 "['_properties']"

danpovey · 2020-11-14T07:56:00Z

It maybe just isn't being printed for some reason. It seems the problem is that the Fsa has the attribute e.g. 'tot_scores_tropical' which has a grad_fn _GetTotScoresFunctionBackward, which has a reference to the Fsa in its ctx.

csukuangfj · 2020-11-14T08:01:10Z

So what if we invoke del ctx.fsa inside the backward function to break the circular reference chain?

danpovey · 2020-11-14T08:16:34Z

That's an interesting idea but I don't like the solution because it will cause a leak if someone doesn't end up calling backward, e.g. because of a problem that required abandoning the minibatch.

danpovey · 2020-11-14T08:17:36Z

I am leaning towards, in the short term, just not having the FSA cache the total scores.
Also, how do you guys think about renaming update_xxx to get_xxx? It seems to me that they are not just updating
them, they are also returning them, so get is a better name.

csukuangfj · 2020-11-14T08:18:50Z

Renaming is fine with me.

danpovey · 2020-11-14T08:34:06Z

I am working on this.

zhichaowang mentioned this issue Apr 26, 2021

Convergence & decoding problem #170

Open

danpovey mentioned this issue Jun 21, 2021

CUDA out of memory #216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory blowup #10

Memory blowup #10

danpovey commented Nov 13, 2020

danpovey commented Nov 14, 2020

qindazhu commented Nov 14, 2020

danpovey commented Nov 14, 2020

csukuangfj commented Nov 14, 2020

danpovey commented Nov 14, 2020

danpovey commented Nov 14, 2020

csukuangfj commented Nov 14, 2020

danpovey commented Nov 14, 2020

Memory blowup #10

Memory blowup #10

Comments

danpovey commented Nov 13, 2020

danpovey commented Nov 14, 2020

qindazhu commented Nov 14, 2020

danpovey commented Nov 14, 2020

csukuangfj commented Nov 14, 2020

danpovey commented Nov 14, 2020

danpovey commented Nov 14, 2020

csukuangfj commented Nov 14, 2020

danpovey commented Nov 14, 2020