Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataParallel] flatten_parameters doesn't work under torch.no_grad #21108

Closed
apsdehal opened this issue May 30, 2019 · 2 comments
Closed

[DataParallel] flatten_parameters doesn't work under torch.no_grad #21108

apsdehal opened this issue May 30, 2019 · 2 comments
Assignees
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@apsdehal
Copy link

apsdehal commented May 30, 2019

🐛 Bug

When the model is using DataParallel and we call flatten_parameters inside the model under torch.no_grad it throws this error:

RuntimeError: set_storage is not allowed on Tensor created from .data or .detach()

works fine otherwise. This behavior only happens on 1.1.0 and was working fine on 1.0.1.post2

To Reproduce

Run the code below on 1.1.0 to reproduce the behavior:

import torch

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = torch.nn.LSTM(300, 1024, 1, batch_first=True, bidirectional=True)
    def forward(self, x):
        self.rnn.flatten_parameters()
        return self.rnn(x)  # N * T * hidden_dim


model = torch.nn.DataParallel(Model().to('cuda'))

with torch.no_grad():
    x = model(torch.rand(2, 4, 300))

Expected behavior

flatten_parameters should work as it does without DataParallel

Environment

Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.1 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: version 3.9.4

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration:
GPU 0: Quadro GP100
GPU 1: Quadro GP100

Nvidia driver version: 410.79
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] msgpack-numpy==0.4.1
[pip] numpy==1.16.4
[pip] numpydoc==0.7.0
[pip] pytorch-nlp==0.3.5
[pip] pytorch-pretrained-bert==0.3.0
[pip] torch==1.1.0
[pip] torchfile==0.1.0
[pip] torchtext==0.2.3
[pip] torchvision==0.2.0
[conda] cuda90 1.0 h6433d27_0 pytorch
[conda] faiss-cpu 1.2.1 py36_cuda9.0.176_1 pytorch
[conda] faiss-gpu 1.2.1 py36_cuda9.0.176_1 pytorch
[conda] magma-cuda90 2.3.0 1 pytorch
[conda] mkl 2018.0.1 h19d6760_4 anaconda
[conda] mkl-fft 1.0.0
[conda] mkl-include 2018.0.3 1
[conda] mkl-random 1.0.1
[conda] mkl-service 1.1.2 py36h17a0993_4
[conda] mkl_fft 1.0.2 np114py36_intel_0 [intel] intel
[conda] mkl_random 1.0.1 np114py36_intel_0 [intel] intel
[conda] mkldnn 0.14.0 0 mingfeima
[conda] nccl2 1.0 0 pytorch
[conda] pytorch-nlp 0.3.5
[conda] pytorch-pretrained-bert 0.3.0
[conda] torch 1.1.0
[conda] torchfile 0.1.0
[conda] torchtext 0.2.3
[conda] torchvision 0.2.0

@Emrys365
Copy link

Emrys365 commented Nov 12, 2019

I met a very similar bug with torch.nn.parallel.data_parallel in PyTorch 1.2.0/1.3.0.

When applying data_parallel to the model calling flatten_parameters in the forward pass under torch.no_grad, it also throws the same error:

RuntimeError: set_storage is not allowed on a Tensor created from .data or .detach().

You can run the code below on 1.2.0/1.3.0 to reproduce the behavior:

import torch
from torch.nn.parallel import data_parallel

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = torch.nn.LSTM(300, 1024, 1, batch_first=True, bidirectional=True)
    def forward(self, x):
        self.rnn.flatten_parameters()
        return self.rnn(x)  # N * T * hidden_dim


model = Model().to('cuda')
x = torch.rand(4, 52, 300, device='cuda')

with torch.no_grad():
    data_parallel(model, x, range(2))

Environment

PyTorch version: 1.2.0/1.3.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: CentOS 7
GCC version: 6.4.0
CMake version: 3.12.0

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130

GPU models and configuration:
GPU 0: Tesla K40m
GPU 1: Tesla K40m
Nvidia driver version: 418.56

@leonardoaraujosantos
Copy link

Guys, I think the issue is somehow related to how internally GRU/LSTM deal with the hidden/cell states when they are None, for example the following code works on 1.2.0 and 1.3.0

import torch
from torch.nn.parallel import data_parallel

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
num_gpu = torch.cuda.device_count()
print('Number of GPUs Available:', num_gpu)

def initHidden(batch_size, bidirectional, hidden_size, num_layers, device, num_gpu):
    '''
    This function is used to create a init vector for GRU/LSTMs
    '''
    if bidirectional:
        num_directions=2
    else:
        num_directions=1
    if num_gpu > 1:
        # The Dataparallel does split by default on dim=0 so we create like this to transpose
        # inside the model forward
        hidden = torch.zeros(batch_size, num_layers * num_directions, hidden_size, device=device)
        initial_cell = torch.zeros(batch_size, num_layers * num_directions, hidden_size, device=device)
        return hidden, initial_cell
    else:
        hidden = torch.zeros(num_layers * num_directions, batch_size, hidden_size, device=device)
        initial_cell = torch.zeros(num_layers * num_directions, batch_size, hidden_size, device=device)
        return hidden, initial_cell

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.rnn = torch.nn.GRU(300, 1024, 1, batch_first=True, bidirectional=True)
    def forward(self, x, hidden):
        if self.training:
            self.rnn.flatten_parameters()
        return self.rnn(x, hidden.permute(1,0,2).contiguous())  # N * T * hidden_dim


model = Model()
if num_gpu > 1:
    model = torch.nn.DataParallel(model)
model = model.to(device)

x = torch.rand(4, 52, 300, device='cuda')
hidden = initHidden(4, True, 1024, 1, device, num_gpu)

with torch.no_grad():
    model(x,hidden[0])

lijunzh added a commit to yewsg/yews that referenced this issue Aug 1, 2020
…ism models (#18)

* Replace MIT license by Apache 2.0

Protect trademarks and logos

* include license and readme for pip package

* add tests for yews.transform.functional

* try travis from torchvision

* Add travis.ci badge

* Fix bug in travis.yml (#1)

* Fix bug in travis.yml

* add codecov badge

* add python 3.7 to travis.ci

* Add python 3.7 travis image

* add download badge

* update logo color

* try torchvision’s sphinx setup

* Update logo

* Add appveyor.yml for Windows CI (#3)

Add appveyor.yml for windows CI

* add appveyor badge

* add bages for anaconda cloud and pypi

* move badge below the title

* remove line between logo and title

* Add more test to transforms (#2)

* add more test to transforms

* replace torch and numpy in module by direct import

* add tests for transform correctness

* Add instructin to install pytorch first

* Update conda command

* Uploading PyTorch builds to lijunzhu channel

* not import yews in docs

* Create initial docs (#4)

* init docs by sphinx.

* Update documentation theme to blue

* add doc to README

* use www subdomain for docs

* Squashed commit of the following:

commit ce4b445
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 13:12:43 2019 -0400

    yews.transform under cover with 100% coverage.

commit 2cf6108
Author: Lijun Zhu <lijunzh@users.noreply.github.com>
Date:   Tue Apr 16 09:01:48 2019 -0400

    use www subdomain for docs

commit 4c0b060
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 13:15:07 2019 -0400

    add is_dataset to check dataset-like objects

commit 5a7b0e8
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Tue Apr 16 21:46:35 2019 -0400

    Refactorize yews.datasets

* Squashed commit of the following:

commit cbb2b6b
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 15:04:07 2019 -0400

    rename module to avoid python built-ins

commit e9ebf46
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 15:27:41 2019 -0400

    yews.files under cover.

commit 744daae
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 15:04:24 2019 -0400

    yews.datasets.dirs under cover

commit 1f8be24
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 15:04:07 2019 -0400

    rename module to avoid python built-ins

commit df8c897
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 13:44:15 2019 -0400

    add test to yews.datasets

commit ce4b445
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 13:12:43 2019 -0400

    yews.transform under cover with 100% coverage.

commit 2cf6108
Author: Lijun Zhu <lijunzh@users.noreply.github.com>
Date:   Tue Apr 16 09:01:48 2019 -0400

    use www subdomain for docs

commit 4c0b060
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 13:15:07 2019 -0400

    add is_dataset to check dataset-like objects

commit 5a7b0e8
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Tue Apr 16 21:46:35 2019 -0400

    Refactorize yews.datasets

* Add transform to convert label to int

* Improve conda installation

Use both lijunzhu and pytorch channels.

* Squashed commit of the following:

commit efe8105c558319d8145b0033f9c108466ca9ad97
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 21:57:49 2019 -0400

    automate building process

commit b9b5ba8f86129ce422d0d521e26064c12bbb88e8
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 21:57:07 2019 -0400

    hide usage of scipy until necessary

commit 9a7c981781ef4f32b346988e493feaf6be02b9dc
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed Apr 17 20:36:35 2019 -0400

    comply with PyPI rst requirement.

* automate release

* improve docs.

* move docs to a separate repo.

* bump version to 0.0.3

* update logo url

* Move metadata from setup.py to setup.cfg

* improve automation.

* yews get version from pkg installation.

* Use scipy as an extra feature

Known issue:
`scipy` is imported via try-exception which is hard to unittest. However, it has been tested under conda env with/without scipy installed to verify the expected ModuleNotFoundError raised properly.

* fix a bug to version in yews.__init__

* Remove not-runnable code from coverage report.

* add pre-commit-config

* add changelog.rst

* change .coveragerc

do not ignore __repr__ and NotImplementedError

* add staticmethod valid() to check path.

* refactorizing BaseDataset

Add is_valid() and _handle_invalid()

* add smoke test via @pytest.mark.smoke

* modify datasets error msgs.

* add `yews.datasets.utils` with tests covered 100%

* check end of file

* remove redundant __about__.py

* update changelog

* add `datasets.wenchuan`

* fix code issues.

* add test to datasets.wenchuan

pytest.mark tests requiring internet connection.

* update wenchuan example according to new api

* clean temp files due to broken tests.

* try svg for logo image

* change back to gif

* optimize logo and readme layout for mobile

* add memory_limit to control loading of .npy file.

* bump version to 0.0.4

* fix a typo.

* add scipy to host environment

* update docs url

* fix doctring typo

* sync meta.yaml and setup.cfg for install and test.

* avoid downloading large file during test

Slow internet connection friendly. Traive-CI will still run the full test.

* explicitly add allow_pickle for older numpy.

* update CHANGELOG.rst

* Update installation notes in README.rst

* Implement original cpic model in the paper.

* create mariana dataset and tools to support it.

* bump version to 0.0.5

* wenchuan dataset released to public

* add numpy verion requirement for pathlib usage

* fix test_datasets.py bug

* add packaged SCSN dataset

* avoid large downlad on traivs-ci

* add detection example for mariana dataset

* attempt to add OK dataset in the same way as Mariana

* ignore all model files

* Squashed commit of the following:

commit 6b2c3cd
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Tue May 7 12:08:45 2019 -0400

    fix wrong test name for tar utils

commit 19e4a1c
Merge: 0270bb9 010a8dd
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed May 1 21:39:52 2019 -0400

    Merge branch 'master' into improve_scsn_dataset

    * master:
      ignore all model files

commit 0270bb9
Author: Lijun Zhu <gatechzhu@gmail.com>
Date:   Wed May 1 21:32:23 2019 -0400

    repace tar.bz2 by tar for SCSN dataset

* fix function name

* fix bias bug

* save results after training

* save model class name

* Allow save and load checkpoint.

* ignore tags file from ctags.

* add resume function during training

* add picking

* remove dimension check for numpy waveform

* rename deploy to cpic

* add hubconf for torch.hub module

* migrate to torch.hub load_url

* update cpic model

* update wenchuan example for new models.cpic module

* fix model_device bug

* test training results save

* fix wenchuan example path bug

* fix wenchuan result path name

* get filename as staticmethod

* add ok_transfer example

* fix bug in sac dataset

* fix typo in ok transfer

* fix typo in ok transfer

* add loader to ok dataset

* fix glob bug for ok transfer

* change path str for obspy read

* convert path after creating label

* try to fix appveyor

* install obspy for appveyor

* do not download large file during testing

* use tar instead of tar.bz2 for packaged datasets

* use model intead of model_gen for trainer

* show accuracy at the end of each epoch

* add cpic model pretrianed on wenchuan dataset

* rename example files

* save current and best checkpoint during training

* training from initial model

* add scipy as a mandatory dependency

* new deployment example for Mw 7.5 earthquake in southern pacific

* start a doucment for rbp installation steps.

* add miniconda and build pytorch from source

* update environmental variable

* disable qnnpack

* fix a bug in applying transform during inference

* bump version before release

* correct typo

* use tensor stack instead array stack

* update example for sp deployment example

* add raspberry pi files

* marian deployment example

* update cpic with a simplified model

* update rbp example with plots

* add batch_size for deployment

* update rbp example

* update rbp example with simplified model

* update rbp example with simplified model

* update rbp example

* Correct a typo

* Making some utils available outside classes

* add example for preparing dataset from file names

* move old example to experimental

* reorganize dataset pacakge

* correct typo

* skipped broken waveforms.

* fix bug

* fix test for new dataset package

* remove unused varialbe.

* prepare waveform by groups

* avoid skipping the entire group for one invalid phase.

* merge groups of npys into one

* add notes for merging large npy arrays.

* make some object available on the top level of yews package.

* docs fiex

* fix bugs in example

* training example

* increase batch size for faster training and validation

* increase memory limit to load the entire dataset in memory

* run logner training

* Update package structure

Move source code under src/ directory and update tests module.

* Update conda install pytorch command for testing.

* Local test skip downloading large files.

* Update mmap store code for npy.

* Raise exception when file not exists.

Previously, it returns a empty list which defer the raise of exception
later when the list is consumed. However, it makes it harder to pin
point the problem when debugging the code. Thus, we will raise exception
at the moment it finds out the target directory does not exists.

* Put a soft link to data inside example directory.

* Temporarily disable tqdm in exporting data.

Tests passed on local Mac setup but fails in both Linux and Windows
remote machines.

* Update CHANGELOG

* Fix meta.yaml depdendency

* Improve anaconda build process.

Anaconda build was broken due to local files containing third party
projects.

Summary of Changes:

1. Build recipe from GitHub master branch instead of local path

2. Move anaconda recipe to a separate folder to avoid including
unintended large files in the build

3. Update Makefile to work with the current file structure

* Use softlink to data path.

* Update URLs to package datasets.

* change Wenchuan data url from gt to dropbox

* add packaged_datasets SCSN_polarity

* add packaged_datasets SCSN_polarity

* change MEMORY_LIMIT from 2g to 10g

* add import polarity.py

* add polarity.py

* add import numpy to polarity.py

* added comment for pull request test

* delete commit examples in polarity.py

* add wenchuan cpic example

* add scsn polarity training example

* delete the note of 2d, will see it in the focal_mechanism.py

* primitive LSTM model added in polarity.py

* add Taiwan_focal_mechanism dataset

* add Taiwan_focal_mechanism dataset

* add focal_mechanism model

* add focal_mechanism model

* rename scsn.training.py to scsn_polarity_cnn.training.py

* rename scsn.training.py to scsn_polarity_cnn.training.py

* add taiwan_focal_mechanism.training.py to example

* change the batch_size and learning rate of this example

* add VGG style fm_v2 into models/focal_mechanism.py

* modified VGG style fm_v2, use dropout(0.1) after each maxpool

* delete unknow label, add vgg style model, remove the last 2 cnn layers

* delete unknow label, add vgg style model, remove the last 2 cnn layers

* add vgg style model for grad-cam, remove the last 2 cnn layers, stop at 4*4

* add a backup line of using AdamW instead of Adam

* working LSTM (bidirectional untested)

* finished LSTM for polarity

* added example for polarity LSTM

* fix the indent

* change the wenchuan example file name

* fix indent of polarity.py again

* add a note: please use only 1 gpu to run LSTM, pytorch/pytorch#21108

* add a note: please use only 1 gpu to run LSTM, pytorch/pytorch#21108

* fix the dsets name in the example

* add WeightedRandomSampler to balance the numbers of different labels in each batch

* add Taiwan20092010 of cpic into packaged_datasets.py and __init__.py

* add Taiwan20092010 of cpic into packaged_datasets.py and __init__.py

* add example for cpic: Taiwan20092010

* add vgg style model cpicv3, stop at 4 for grad-cam

* add vgg style model cpicv3, stop at 4 for grad-cam

* vgg style model FmV2 stop at 8*8

* vgg style model FmV2 stop at 8*8

* update cpic.py and wenchuan_cpic.training.py based on the test of grad-cam with cpic_v3, 2000->1000->7->fc

* forget why, so just add a comment #wav = wav.astype(float) into src/yews/transforms/functional.py as a backup

* add RemoveMean RemoveTrend Taper BandpassFilter into src/yews/transforms/transforms.py

* add polarity_cnn_lstm from Zijian Li

* input 600->300

* rm data in example

* fix super

* update example

* update cnn_lstm

* add a line for LSTM which can only use one gpu

* need to be updated, how to read the pretrained model

* Resolve merge conflict

* delete train.py.bak

* fix bug <<<<<<< ======= >>>>>>>

* recover some image and target

* nothing important

* xxxx to null link

* delete the commit of using 1 gpu, in the future, use: device = torch.device(cuda:0 if torch.cuda.is_available() else cpu) model_on_device(model, device)

* remove RemoveMean, change Taper and BandpassFilter

Co-authored-by: Lijun Zhu <lijunzh@users.noreply.github.com>
Co-authored-by: Lijun Zhu <gatechzhu@gmail.com>
Co-authored-by: Chujie Chen <38991172+ChujieChen@users.noreply.github.com>
Co-authored-by: ChujieChen <chen8chu8jie6@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants