Resuming Training #4

snrazavi · 2017-02-26T07:05:22Z

Hello,
I am trying to train a seq-att model for translation, but always after 3 or 4 epochs, the training process stops unexpectedly (Kill). Moreover, when I try to resume the training using --model_in option I recieve out of memory error and this is regardless of how much GPU memory I use using --dynet-mem. I have a GTX 980 GPU with 8 Giga bytes of graphics RAM. Also I should add that the memory problem is more critical and severe when I try to use other optimization methods such as adam.
Many thanks in advance.

neubig · 2017-02-26T14:03:54Z

This is two problems I think.

The training process is stopping unexpectedly. I'll need more details about the exact error message you get then to try to debug this.
When re-loading models you're running out of memory. Unfortunately this is a known bug in DyNet: reallocate memory when dynet_load_model cause double memory consumption. clab/dynet#110 and will probably not be fixed until this bug is fixed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming Training #4

Resuming Training #4

snrazavi commented Feb 26, 2017 •

edited

Loading

neubig commented Feb 26, 2017

Resuming Training #4

Resuming Training #4

Comments

snrazavi commented Feb 26, 2017 • edited Loading

neubig commented Feb 26, 2017

snrazavi commented Feb 26, 2017 •

edited

Loading