ProductionRuleField not compatible with multiple GPUs #2057

YoPatapon · 2018-11-15T10:41:10Z

System (please complete the following information):

OS: Linux
Python version: 3.7
AllenNLP version: 0.7.1
PyTorch version: 0.4.1

Question
I am training mml parser on the wikitables dataset. I found it would be out of cuda memory if I use a larger batch size than 8 on a single Tesla P100. Even the batch size 8 causes the out-of-memory error sometimes. How can I train on multiple gpus with allennlp? I don't think the cuda_device field supports list input in the config file for now.

The text was updated successfully, but these errors were encountered:

matt-gardner · 2018-11-15T16:51:33Z

cuda_device supports a list as input in the config file. Did you try this and it didn't work?

YoPatapon · 2018-11-16T02:52:00Z

Yes, I tested with cuda_deivce=[1,2] and is raised an error:

RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'
2018-11-16 10:47:37,239 - INFO - allennlp.semparse.executors.wikitables_sempre_executor - Stopped SEMPRE server

matt-gardner · 2018-11-16T03:06:18Z

Can you give more of the stack trace?

YoPatapon · 2018-11-16T05:33:38Z

@matt-gardner The entire error trace is:

Traceback (most recent call last):
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/v-dozhao/Work/allennlp/allennlp/run.py", line 18, in
main(prog="allennlp")
File "/home/v-dozhao/Work/allennlp/allennlp/commands/init.py", line 72, in main
args.func(args)
File "/home/v-dozhao/Work/allennlp/allennlp/commands/train.py", line 111, in train_model_from_args
args.force)
File "/home/v-dozhao/Work/allennlp/allennlp/commands/train.py", line 142, in train_model_from_file
return train_model(params, serialization_dir, file_friendly_logging, recover, force)
File "/home/v-dozhao/Work/allennlp/allennlp/commands/train.py", line 352, in train_model
metrics = trainer.train()
File "/home/v-dozhao/Work/allennlp/allennlp/training/trainer.py", line 755, in train
train_metrics = self._train_epoch(epoch)
File "/home/v-dozhao/Work/allennlp/allennlp/training/trainer.py", line 495, in _train_epoch
loss = self.batch_loss(batch, for_training=True)
File "/home/v-dozhao/Work/allennlp/allennlp/training/trainer.py", line 427, in batch_loss
output_dict = self._data_parallel(batch)
File "/home/v-dozhao/Work/allennlp/allennlp/training/trainer.py", line 414, in _data_parallel
outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/v-dozhao/Work/allennlp/allennlp/models/semantic_parsing/wikitables/wikitables_mml_semantic_parser.py", line 176, in forward
outputs)
File "/home/v-dozhao/Work/allennlp/allennlp/models/semantic_parsing/wikitables/wikitables_semantic_parser.py", line 288, in _get_initial_rnn_and_grammar_state
for i in range(batch_size)]
File "/home/v-dozhao/Work/allennlp/allennlp/models/semantic_parsing/wikitables/wikitables_semantic_parser.py", line 288, in
for i in range(batch_size)]
File "/home/v-dozhao/Work/allennlp/allennlp/models/semantic_parsing/wikitables/wikitables_semantic_parser.py", line 578, in _create_grammar_state
global_input_embeddings = self._action_embedder(global_action_tensor)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/v-dozhao/Work/allennlp/allennlp/modules/token_embedders/embedding.py", line 129, in forward
sparse=self.sparse)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/functional.py", line 1110, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'
2018-11-16 10:47:37,239 - INFO - allennlp.semparse.executors.wikitables_sempre_executor - Stopped SEMPRE server
[INFO/MainProcess] process shutting down

Is the format cuda_device=[1,2], right?

matt-gardner · 2018-11-16T15:46:16Z

Ok, thanks, this is a data type that I thought we had fixed in #1944. Looks like that commit was the first one that wasn't included in the 0.7.1 release. Can you try again from master and see if it fixes the issue?

YoPatapon · 2018-11-17T04:07:14Z

@matt-gardner I re-built allennlp from source that pulled from master branch. However, I still got the same error message when using multiple gpus.

matt-gardner · 2018-11-17T19:18:41Z

Ok, we'll need to look into this, but it probably won't be soon, unfortunately. With holidays coming up and then the NAACL deadline, we don't really have time to look into this ourselves right now.

YoPatapon · 2018-11-18T04:11:22Z

OK. Good luck with your NAACL!

matt-gardner · 2018-12-18T01:00:56Z

See #2199; not a fix yet, but I've at least diagnosed the problem.

matt-gardner · 2019-06-17T14:50:18Z

This was fixed by #2200.

schmmd assigned matt-gardner Dec 10, 2018

matt-gardner mentioned this issue Dec 18, 2018

Failing test case for multi-GPU ProductionRuleField #2199

Closed

matt-gardner changed the title ~~How to train using multiple gpus?~~ ProductionRuleField not compatible with multiple GPUs Dec 18, 2018

matt-gardner closed this as completed Jun 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProductionRuleField not compatible with multiple GPUs #2057

ProductionRuleField not compatible with multiple GPUs #2057

YoPatapon commented Nov 15, 2018

matt-gardner commented Nov 15, 2018

YoPatapon commented Nov 16, 2018

matt-gardner commented Nov 16, 2018

YoPatapon commented Nov 16, 2018 •

edited

Loading

matt-gardner commented Nov 16, 2018 •

edited

Loading

YoPatapon commented Nov 17, 2018 •

edited

Loading

matt-gardner commented Nov 17, 2018

YoPatapon commented Nov 18, 2018

matt-gardner commented Dec 18, 2018

matt-gardner commented Jun 17, 2019

ProductionRuleField not compatible with multiple GPUs #2057

ProductionRuleField not compatible with multiple GPUs #2057

Comments

YoPatapon commented Nov 15, 2018

matt-gardner commented Nov 15, 2018

YoPatapon commented Nov 16, 2018

matt-gardner commented Nov 16, 2018

YoPatapon commented Nov 16, 2018 • edited Loading

matt-gardner commented Nov 16, 2018 • edited Loading

YoPatapon commented Nov 17, 2018 • edited Loading

matt-gardner commented Nov 17, 2018

YoPatapon commented Nov 18, 2018

matt-gardner commented Dec 18, 2018

matt-gardner commented Jun 17, 2019

YoPatapon commented Nov 16, 2018 •

edited

Loading

matt-gardner commented Nov 16, 2018 •

edited

Loading

YoPatapon commented Nov 17, 2018 •

edited

Loading