Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

ProductionRuleField not compatible with multiple GPUs #2057

Closed
YoPatapon opened this issue Nov 15, 2018 · 10 comments
Closed

ProductionRuleField not compatible with multiple GPUs #2057

YoPatapon opened this issue Nov 15, 2018 · 10 comments
Assignees

Comments

@YoPatapon
Copy link
Contributor

System (please complete the following information):

  • OS: Linux
  • Python version: 3.7
  • AllenNLP version: 0.7.1
  • PyTorch version: 0.4.1

Question
I am training mml parser on the wikitables dataset. I found it would be out of cuda memory if I use a larger batch size than 8 on a single Tesla P100. Even the batch size 8 causes the out-of-memory error sometimes. How can I train on multiple gpus with allennlp? I don't think the cuda_device field supports list input in the config file for now.

@matt-gardner
Copy link
Contributor

cuda_device supports a list as input in the config file. Did you try this and it didn't work?

@YoPatapon
Copy link
Contributor Author

Yes, I tested with cuda_deivce=[1,2] and is raised an error:

RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'
2018-11-16 10:47:37,239 - INFO - allennlp.semparse.executors.wikitables_sempre_executor - Stopped SEMPRE server

@matt-gardner
Copy link
Contributor

Can you give more of the stack trace?

@YoPatapon
Copy link
Contributor Author

YoPatapon commented Nov 16, 2018

@matt-gardner The entire error trace is:

Traceback (most recent call last):
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/v-dozhao/Work/allennlp/allennlp/run.py", line 18, in
main(prog="allennlp")
File "/home/v-dozhao/Work/allennlp/allennlp/commands/init.py", line 72, in main
args.func(args)
File "/home/v-dozhao/Work/allennlp/allennlp/commands/train.py", line 111, in train_model_from_args
args.force)
File "/home/v-dozhao/Work/allennlp/allennlp/commands/train.py", line 142, in train_model_from_file
return train_model(params, serialization_dir, file_friendly_logging, recover, force)
File "/home/v-dozhao/Work/allennlp/allennlp/commands/train.py", line 352, in train_model
metrics = trainer.train()
File "/home/v-dozhao/Work/allennlp/allennlp/training/trainer.py", line 755, in train
train_metrics = self._train_epoch(epoch)
File "/home/v-dozhao/Work/allennlp/allennlp/training/trainer.py", line 495, in _train_epoch
loss = self.batch_loss(batch, for_training=True)
File "/home/v-dozhao/Work/allennlp/allennlp/training/trainer.py", line 427, in batch_loss
output_dict = self._data_parallel(batch)
File "/home/v-dozhao/Work/allennlp/allennlp/training/trainer.py", line 414, in _data_parallel
outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/v-dozhao/Work/allennlp/allennlp/models/semantic_parsing/wikitables/wikitables_mml_semantic_parser.py", line 176, in forward
outputs)
File "/home/v-dozhao/Work/allennlp/allennlp/models/semantic_parsing/wikitables/wikitables_semantic_parser.py", line 288, in _get_initial_rnn_and_grammar_state
for i in range(batch_size)]
File "/home/v-dozhao/Work/allennlp/allennlp/models/semantic_parsing/wikitables/wikitables_semantic_parser.py", line 288, in
for i in range(batch_size)]
File "/home/v-dozhao/Work/allennlp/allennlp/models/semantic_parsing/wikitables/wikitables_semantic_parser.py", line 578, in _create_grammar_state
global_input_embeddings = self._action_embedder(global_action_tensor)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/v-dozhao/Work/allennlp/allennlp/modules/token_embedders/embedding.py", line 129, in forward
sparse=self.sparse)
File "/home/v-dozhao/anaconda3/envs/py3.7torch/lib/python3.7/site-packages/torch/nn/functional.py", line 1110, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.LongTensor for argument #3 'index'
2018-11-16 10:47:37,239 - INFO - allennlp.semparse.executors.wikitables_sempre_executor - Stopped SEMPRE server
[INFO/MainProcess] process shutting down

Is the format cuda_device=[1,2], right?

@matt-gardner
Copy link
Contributor

matt-gardner commented Nov 16, 2018

Ok, thanks, this is a data type that I thought we had fixed in #1944. Looks like that commit was the first one that wasn't included in the 0.7.1 release. Can you try again from master and see if it fixes the issue?

@YoPatapon
Copy link
Contributor Author

YoPatapon commented Nov 17, 2018

@matt-gardner I re-built allennlp from source that pulled from master branch. However, I still got the same error message when using multiple gpus.

@matt-gardner
Copy link
Contributor

Ok, we'll need to look into this, but it probably won't be soon, unfortunately. With holidays coming up and then the NAACL deadline, we don't really have time to look into this ourselves right now.

@YoPatapon
Copy link
Contributor Author

OK. Good luck with your NAACL!

@matt-gardner
Copy link
Contributor

See #2199; not a fix yet, but I've at least diagnosed the problem.

@matt-gardner matt-gardner changed the title How to train using multiple gpus? ProductionRuleField not compatible with multiple GPUs Dec 18, 2018
@matt-gardner
Copy link
Contributor

This was fixed by #2200.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants