Trainer在最后一个epoch之后卡住， tester报错 #259

Zhiyu-Chen · 2019-12-08T10:07:38Z

在跑matching_esim.py这个代码，运行trainer.train()最后一个epoch之后不会返回，一直卡住。

如果用validation set或者在跑tester.test()也会报错

Traceback (most recent call last):
File "", line 2, in
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/fastNLP/core/tester.py", line 165, in test
pred_dict = self._data_forward(self._predict_func, batch_x)
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/fastNLP/core/tester.py", line 213, in _data_forward
y = self._predict_func_wrapper(**x)
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/fastNLP/core/_parallel_utils.py", line 91, in wrapper
outputs = parallel_apply(replicas, func_name, inputs, kwargs, device_ids[:len(replicas)])
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/fastNLP/core/_parallel_utils.py", line 71, in parallel_apply
raise output
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/fastNLP/core/parallel_utils.py", line 47, in worker
output = getattr(module, func_name)(*input, **kwargs)
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/fastNLP/models/snli.py", line 118, in predict
pred = self.forward(**kwargs)[Const.OUTPUT].argmax(-1)
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/fastNLP/models/snli.py", line 84, in forward
a = self.rnn(a0, mask1.byte()) # a: [B, PL, 2 * H]
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/fastNLP/models/snli.py", line 167, in forward
self.rnn.flatten_parameters()
File "/home/colozoy/.conda/envs/p3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 120, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: set_storage is not allowed on a Tensor created from .data or .detach().
If your intent is to change the metadata of a Tensor (such as sizes / strides / storage / storage_offset)
without autograd tracking the change, remove the .data / .detach() call and wrap the change in a with torch.no_grad(): block.
For example, change:
x.data.set(y)
to:
with torch.no_grad():
x.set(y)

The text was updated successfully, but these errors were encountered:

Zhiyu-Chen · 2019-12-08T23:03:38Z

training 卡住的bug突然莫名其妙好了。。
但是tester的bug还是在，可能是DataParallel的问题，pytorch的bug：pytorch/pytorch#21108

xuyige · 2019-12-12T08:48:59Z

您好，感谢您的issue
这个文件的代码是在1.0.0版本的基础上写的，更新到1.3.0之后出现了一些问题，因此我之前对它做了一个小的修改。目前来说建议使用1.2.0及以下版本跑这个脚本，1.3.1版本个人没有使用所以不大确定，1.3.0应当是确认可能会出现该问题

xuyige closed this as completed Dec 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer在最后一个epoch之后卡住， tester报错 #259

Trainer在最后一个epoch之后卡住， tester报错 #259

Zhiyu-Chen commented Dec 8, 2019 •

edited

Loading

Zhiyu-Chen commented Dec 8, 2019

xuyige commented Dec 12, 2019

Trainer在最后一个epoch之后卡住， tester报错 #259

Trainer在最后一个epoch之后卡住， tester报错 #259

Comments

Zhiyu-Chen commented Dec 8, 2019 • edited Loading

Zhiyu-Chen commented Dec 8, 2019

xuyige commented Dec 12, 2019

Zhiyu-Chen commented Dec 8, 2019 •

edited

Loading