-
Notifications
You must be signed in to change notification settings - Fork 23.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataParallel] flatten_parameters doesn't work under torch.no_grad #21108
Comments
I met a very similar bug with When applying
You can run the code below on 1.2.0/1.3.0 to reproduce the behavior: import torch
from torch.nn.parallel import data_parallel
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.rnn = torch.nn.LSTM(300, 1024, 1, batch_first=True, bidirectional=True)
def forward(self, x):
self.rnn.flatten_parameters()
return self.rnn(x) # N * T * hidden_dim
model = Model().to('cuda')
x = torch.rand(4, 52, 300, device='cuda')
with torch.no_grad():
data_parallel(model, x, range(2)) EnvironmentPyTorch version: 1.2.0/1.3.0 OS: CentOS 7 Python version: 3.7 GPU models and configuration: |
Guys, I think the issue is somehow related to how internally GRU/LSTM deal with the hidden/cell states when they are None, for example the following code works on 1.2.0 and 1.3.0 import torch
from torch.nn.parallel import data_parallel
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
num_gpu = torch.cuda.device_count()
print('Number of GPUs Available:', num_gpu)
def initHidden(batch_size, bidirectional, hidden_size, num_layers, device, num_gpu):
'''
This function is used to create a init vector for GRU/LSTMs
'''
if bidirectional:
num_directions=2
else:
num_directions=1
if num_gpu > 1:
# The Dataparallel does split by default on dim=0 so we create like this to transpose
# inside the model forward
hidden = torch.zeros(batch_size, num_layers * num_directions, hidden_size, device=device)
initial_cell = torch.zeros(batch_size, num_layers * num_directions, hidden_size, device=device)
return hidden, initial_cell
else:
hidden = torch.zeros(num_layers * num_directions, batch_size, hidden_size, device=device)
initial_cell = torch.zeros(num_layers * num_directions, batch_size, hidden_size, device=device)
return hidden, initial_cell
class Model(torch.nn.Module):
def __init__(self):
super().__init__()
self.rnn = torch.nn.GRU(300, 1024, 1, batch_first=True, bidirectional=True)
def forward(self, x, hidden):
if self.training:
self.rnn.flatten_parameters()
return self.rnn(x, hidden.permute(1,0,2).contiguous()) # N * T * hidden_dim
model = Model()
if num_gpu > 1:
model = torch.nn.DataParallel(model)
model = model.to(device)
x = torch.rand(4, 52, 300, device='cuda')
hidden = initHidden(4, True, 1024, 1, device, num_gpu)
with torch.no_grad():
model(x,hidden[0]) |
…ism models (#18) * Replace MIT license by Apache 2.0 Protect trademarks and logos * include license and readme for pip package * add tests for yews.transform.functional * try travis from torchvision * Add travis.ci badge * Fix bug in travis.yml (#1) * Fix bug in travis.yml * add codecov badge * add python 3.7 to travis.ci * Add python 3.7 travis image * add download badge * update logo color * try torchvision’s sphinx setup * Update logo * Add appveyor.yml for Windows CI (#3) Add appveyor.yml for windows CI * add appveyor badge * add bages for anaconda cloud and pypi * move badge below the title * remove line between logo and title * Add more test to transforms (#2) * add more test to transforms * replace torch and numpy in module by direct import * add tests for transform correctness * Add instructin to install pytorch first * Update conda command * Uploading PyTorch builds to lijunzhu channel * not import yews in docs * Create initial docs (#4) * init docs by sphinx. * Update documentation theme to blue * add doc to README * use www subdomain for docs * Squashed commit of the following: commit ce4b445 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 13:12:43 2019 -0400 yews.transform under cover with 100% coverage. commit 2cf6108 Author: Lijun Zhu <lijunzh@users.noreply.github.com> Date: Tue Apr 16 09:01:48 2019 -0400 use www subdomain for docs commit 4c0b060 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 13:15:07 2019 -0400 add is_dataset to check dataset-like objects commit 5a7b0e8 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Tue Apr 16 21:46:35 2019 -0400 Refactorize yews.datasets * Squashed commit of the following: commit cbb2b6b Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 15:04:07 2019 -0400 rename module to avoid python built-ins commit e9ebf46 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 15:27:41 2019 -0400 yews.files under cover. commit 744daae Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 15:04:24 2019 -0400 yews.datasets.dirs under cover commit 1f8be24 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 15:04:07 2019 -0400 rename module to avoid python built-ins commit df8c897 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 13:44:15 2019 -0400 add test to yews.datasets commit ce4b445 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 13:12:43 2019 -0400 yews.transform under cover with 100% coverage. commit 2cf6108 Author: Lijun Zhu <lijunzh@users.noreply.github.com> Date: Tue Apr 16 09:01:48 2019 -0400 use www subdomain for docs commit 4c0b060 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 13:15:07 2019 -0400 add is_dataset to check dataset-like objects commit 5a7b0e8 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Tue Apr 16 21:46:35 2019 -0400 Refactorize yews.datasets * Add transform to convert label to int * Improve conda installation Use both lijunzhu and pytorch channels. * Squashed commit of the following: commit efe8105c558319d8145b0033f9c108466ca9ad97 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 21:57:49 2019 -0400 automate building process commit b9b5ba8f86129ce422d0d521e26064c12bbb88e8 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 21:57:07 2019 -0400 hide usage of scipy until necessary commit 9a7c981781ef4f32b346988e493feaf6be02b9dc Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed Apr 17 20:36:35 2019 -0400 comply with PyPI rst requirement. * automate release * improve docs. * move docs to a separate repo. * bump version to 0.0.3 * update logo url * Move metadata from setup.py to setup.cfg * improve automation. * yews get version from pkg installation. * Use scipy as an extra feature Known issue: `scipy` is imported via try-exception which is hard to unittest. However, it has been tested under conda env with/without scipy installed to verify the expected ModuleNotFoundError raised properly. * fix a bug to version in yews.__init__ * Remove not-runnable code from coverage report. * add pre-commit-config * add changelog.rst * change .coveragerc do not ignore __repr__ and NotImplementedError * add staticmethod valid() to check path. * refactorizing BaseDataset Add is_valid() and _handle_invalid() * add smoke test via @pytest.mark.smoke * modify datasets error msgs. * add `yews.datasets.utils` with tests covered 100% * check end of file * remove redundant __about__.py * update changelog * add `datasets.wenchuan` * fix code issues. * add test to datasets.wenchuan pytest.mark tests requiring internet connection. * update wenchuan example according to new api * clean temp files due to broken tests. * try svg for logo image * change back to gif * optimize logo and readme layout for mobile * add memory_limit to control loading of .npy file. * bump version to 0.0.4 * fix a typo. * add scipy to host environment * update docs url * fix doctring typo * sync meta.yaml and setup.cfg for install and test. * avoid downloading large file during test Slow internet connection friendly. Traive-CI will still run the full test. * explicitly add allow_pickle for older numpy. * update CHANGELOG.rst * Update installation notes in README.rst * Implement original cpic model in the paper. * create mariana dataset and tools to support it. * bump version to 0.0.5 * wenchuan dataset released to public * add numpy verion requirement for pathlib usage * fix test_datasets.py bug * add packaged SCSN dataset * avoid large downlad on traivs-ci * add detection example for mariana dataset * attempt to add OK dataset in the same way as Mariana * ignore all model files * Squashed commit of the following: commit 6b2c3cd Author: Lijun Zhu <gatechzhu@gmail.com> Date: Tue May 7 12:08:45 2019 -0400 fix wrong test name for tar utils commit 19e4a1c Merge: 0270bb9 010a8dd Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed May 1 21:39:52 2019 -0400 Merge branch 'master' into improve_scsn_dataset * master: ignore all model files commit 0270bb9 Author: Lijun Zhu <gatechzhu@gmail.com> Date: Wed May 1 21:32:23 2019 -0400 repace tar.bz2 by tar for SCSN dataset * fix function name * fix bias bug * save results after training * save model class name * Allow save and load checkpoint. * ignore tags file from ctags. * add resume function during training * add picking * remove dimension check for numpy waveform * rename deploy to cpic * add hubconf for torch.hub module * migrate to torch.hub load_url * update cpic model * update wenchuan example for new models.cpic module * fix model_device bug * test training results save * fix wenchuan example path bug * fix wenchuan result path name * get filename as staticmethod * add ok_transfer example * fix bug in sac dataset * fix typo in ok transfer * fix typo in ok transfer * add loader to ok dataset * fix glob bug for ok transfer * change path str for obspy read * convert path after creating label * try to fix appveyor * install obspy for appveyor * do not download large file during testing * use tar instead of tar.bz2 for packaged datasets * use model intead of model_gen for trainer * show accuracy at the end of each epoch * add cpic model pretrianed on wenchuan dataset * rename example files * save current and best checkpoint during training * training from initial model * add scipy as a mandatory dependency * new deployment example for Mw 7.5 earthquake in southern pacific * start a doucment for rbp installation steps. * add miniconda and build pytorch from source * update environmental variable * disable qnnpack * fix a bug in applying transform during inference * bump version before release * correct typo * use tensor stack instead array stack * update example for sp deployment example * add raspberry pi files * marian deployment example * update cpic with a simplified model * update rbp example with plots * add batch_size for deployment * update rbp example * update rbp example with simplified model * update rbp example with simplified model * update rbp example * Correct a typo * Making some utils available outside classes * add example for preparing dataset from file names * move old example to experimental * reorganize dataset pacakge * correct typo * skipped broken waveforms. * fix bug * fix test for new dataset package * remove unused varialbe. * prepare waveform by groups * avoid skipping the entire group for one invalid phase. * merge groups of npys into one * add notes for merging large npy arrays. * make some object available on the top level of yews package. * docs fiex * fix bugs in example * training example * increase batch size for faster training and validation * increase memory limit to load the entire dataset in memory * run logner training * Update package structure Move source code under src/ directory and update tests module. * Update conda install pytorch command for testing. * Local test skip downloading large files. * Update mmap store code for npy. * Raise exception when file not exists. Previously, it returns a empty list which defer the raise of exception later when the list is consumed. However, it makes it harder to pin point the problem when debugging the code. Thus, we will raise exception at the moment it finds out the target directory does not exists. * Put a soft link to data inside example directory. * Temporarily disable tqdm in exporting data. Tests passed on local Mac setup but fails in both Linux and Windows remote machines. * Update CHANGELOG * Fix meta.yaml depdendency * Improve anaconda build process. Anaconda build was broken due to local files containing third party projects. Summary of Changes: 1. Build recipe from GitHub master branch instead of local path 2. Move anaconda recipe to a separate folder to avoid including unintended large files in the build 3. Update Makefile to work with the current file structure * Use softlink to data path. * Update URLs to package datasets. * change Wenchuan data url from gt to dropbox * add packaged_datasets SCSN_polarity * add packaged_datasets SCSN_polarity * change MEMORY_LIMIT from 2g to 10g * add import polarity.py * add polarity.py * add import numpy to polarity.py * added comment for pull request test * delete commit examples in polarity.py * add wenchuan cpic example * add scsn polarity training example * delete the note of 2d, will see it in the focal_mechanism.py * primitive LSTM model added in polarity.py * add Taiwan_focal_mechanism dataset * add Taiwan_focal_mechanism dataset * add focal_mechanism model * add focal_mechanism model * rename scsn.training.py to scsn_polarity_cnn.training.py * rename scsn.training.py to scsn_polarity_cnn.training.py * add taiwan_focal_mechanism.training.py to example * change the batch_size and learning rate of this example * add VGG style fm_v2 into models/focal_mechanism.py * modified VGG style fm_v2, use dropout(0.1) after each maxpool * delete unknow label, add vgg style model, remove the last 2 cnn layers * delete unknow label, add vgg style model, remove the last 2 cnn layers * add vgg style model for grad-cam, remove the last 2 cnn layers, stop at 4*4 * add a backup line of using AdamW instead of Adam * working LSTM (bidirectional untested) * finished LSTM for polarity * added example for polarity LSTM * fix the indent * change the wenchuan example file name * fix indent of polarity.py again * add a note: please use only 1 gpu to run LSTM, pytorch/pytorch#21108 * add a note: please use only 1 gpu to run LSTM, pytorch/pytorch#21108 * fix the dsets name in the example * add WeightedRandomSampler to balance the numbers of different labels in each batch * add Taiwan20092010 of cpic into packaged_datasets.py and __init__.py * add Taiwan20092010 of cpic into packaged_datasets.py and __init__.py * add example for cpic: Taiwan20092010 * add vgg style model cpicv3, stop at 4 for grad-cam * add vgg style model cpicv3, stop at 4 for grad-cam * vgg style model FmV2 stop at 8*8 * vgg style model FmV2 stop at 8*8 * update cpic.py and wenchuan_cpic.training.py based on the test of grad-cam with cpic_v3, 2000->1000->7->fc * forget why, so just add a comment #wav = wav.astype(float) into src/yews/transforms/functional.py as a backup * add RemoveMean RemoveTrend Taper BandpassFilter into src/yews/transforms/transforms.py * add polarity_cnn_lstm from Zijian Li * input 600->300 * rm data in example * fix super * update example * update cnn_lstm * add a line for LSTM which can only use one gpu * need to be updated, how to read the pretrained model * Resolve merge conflict * delete train.py.bak * fix bug <<<<<<< ======= >>>>>>> * recover some image and target * nothing important * xxxx to null link * delete the commit of using 1 gpu, in the future, use: device = torch.device(cuda:0 if torch.cuda.is_available() else cpu) model_on_device(model, device) * remove RemoveMean, change Taper and BandpassFilter Co-authored-by: Lijun Zhu <lijunzh@users.noreply.github.com> Co-authored-by: Lijun Zhu <gatechzhu@gmail.com> Co-authored-by: Chujie Chen <38991172+ChujieChen@users.noreply.github.com> Co-authored-by: ChujieChen <chen8chu8jie6@gmail.com>
🐛 Bug
When the model is using
DataParallel
and we callflatten_parameters
inside the model undertorch.no_grad
it throws this error:works fine otherwise. This behavior only happens on 1.1.0 and was working fine on 1.0.1.post2
To Reproduce
Run the code below on 1.1.0 to reproduce the behavior:
Expected behavior
flatten_parameters
should work as it does without DataParallelEnvironment
Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.9.4
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100
Nvidia driver version: 410.79
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] msgpack-numpy==0.4.1
[pip] numpy==1.16.4
[pip] numpydoc==0.7.0
[pip] pytorch-nlp==0.3.5
[pip] pytorch-pretrained-bert==0.3.0
[pip] torch==1.1.0
[pip] torchfile==0.1.0
[pip] torchtext==0.2.3
[pip] torchvision==0.2.0
[conda] cuda90 1.0 h6433d27_0 pytorch
[conda] faiss-cpu 1.2.1 py36_cuda9.0.176_1 pytorch
[conda] faiss-gpu 1.2.1 py36_cuda9.0.176_1 pytorch
[conda] magma-cuda90 2.3.0 1 pytorch
[conda] mkl 2018.0.1 h19d6760_4 anaconda
[conda] mkl-fft 1.0.0
[conda] mkl-include 2018.0.3 1
[conda] mkl-random 1.0.1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.2 np114py36_intel_0 [intel] intel
[conda] mkl_random 1.0.1 np114py36_intel_0 [intel] intel
[conda] mkldnn 0.14.0 0 mingfeima
[conda] nccl2 1.0 0 pytorch
[conda] pytorch-nlp 0.3.5
[conda] pytorch-pretrained-bert 0.3.0
[conda] torch 1.1.0
[conda] torchfile 0.1.0
[conda] torchtext 0.2.3
[conda] torchvision 0.2.0
The text was updated successfully, but these errors were encountered: