-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA OOM when initializing DDP #4705
Comments
Hi @dedeswim can you try to use |
Hey @justusschock , thanks for getting back! When running the script I am getting the following output + exception:
When running in the notebook I get this (also trying a second time as I did with
|
@dedeswim the pickling issue in ddp-spawn is because the class TestModel is defined within a function which makes it local and this cannot be pickled. Can you try again defining it outside the function? |
Oh sorry, didn't think about that. I changed the model to BoringModel directly, and got this output + exception:
|
Hi @dedeswim I tried to reproduce your issue, and I used your script, which I only modified to enable pickling like discussed above. I don't have access to the exact GPUs you have, but on my machine with 2 2080Ti, the script ran without any problem with both accelerators. Here is the example output from my workstation:
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Missing logger folder: /home/staff/schock/lightning_logs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
override
Epoch 0: 0%| | 0/2 [00:00<?, ?it/s]override
Epoch 0: 100%|███████████████| 2/2 [00:00<00:00, 15.18it/s, loss=1.216, v_num=0]
UserWarning: The dataloader, test dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Testing: 0it [00:00, ?it/s]--------------------------------------------------------------------------------
Testing: 100%|████████████████████████████████| 32/32 [00:00<00:00, 2025.99it/s] Can you maybe try to reboot your machine? could this also be related to anything else? Unfortunately I don't know how to debug this. |
Hi @justusschock, thanks for trying to reproduce the issue! Unfortunately, I have no direct control of the cluster, but I could try to ask the admin to reboot it. However, there is a long-running job which I fear I can not interrupt. Meanwhile, I tried to use
The first thing I noticed in the outputs (also as a difference from your output), is the line with
And, as a matter of fact, GPU 0 is busy and being used (the memory being used is However, this issue happens also with 0.8.1 <= PL <= 0.10.0 where the printed CUDA_VISIBLE_DEVICES shows just the selected devices. The only idea I have for debugging is trying to see if there was some major change in the GPU management between 0.7.x and 0.8.x I hope this information can be helpful for debugging! Of course, let me know If I can do anything else. |
@dedeswim Usually this should be fine with For debugging purposes: can you do the following:
That way I want to find out, if the devices are set by us (which seems to be okay) or if they are already preset. If at that time
before running any other code (even imports). Let me know if that helps. |
Hi @justusschock, thanks for your answer. I tried putting, at the very beginning of the script, this: import os
print(os.environ.get('CUDA_VISIBLE_DEVICES', []))
import torch
print(torch.cuda.device_count())
print(torch.cuda.get_device_name()) The output I get is
Plus the usual exception. So I guess, as you say, that the devices are set by you. Unfortunately, even using However, if I launch the script as CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7,8,9 python boring_model.py And it works, with the following output:
(FYI I had to change the GPUs to use from [8, 9] to [7, 8], because ofc the visible GPUs are 9 now) I guess as a temporary workaround I can Is there anything else I can do to help to debug? |
This seems to be related to #958 . can you also try to run the script with CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 , gpu=[7,8] and ddp_spawn? |
With With |
@dedeswim this should be fixed by @awaelchli in #4297 . Mind trying on master? |
Tried with the version installed in a fresh environment via pip install git+https://github.com/PytorchLightning/pytorch-lightning.git@master --upgrade
|
ok, it seems this has nothing todo with the selection of the backend. If I take however a pure pytorch script like this That's all I have in terms of analysis for now. Don't yet have a clue where to look... |
@awaelchli @awaelchli I tried with 1.1.0rc1 and the following script: import torch
import pytorch_lightning as pl
class RandomDataset(torch.utils.data.Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(pl.LightningModule):
def __init__(self):
"""
Testing PL Module
Use as follows:
- subclass
- modify the behavior for what you want
class TestModel(BaseTestModel):
def training_step(...):
# do your own thing
or:
model = BaseTestModel()
model.training_epoch_end = None
"""
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def step(self, x):
x = self(x)
out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
return out
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def training_epoch_end(self, outputs) -> None:
torch.stack([x["loss"] for x in outputs]).mean()
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"x": loss}
def validation_epoch_end(self, outputs) -> None:
torch.stack([x['x'] for x in outputs]).mean()
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"y": loss}
def test_epoch_end(self, outputs) -> None:
torch.stack([x["y"] for x in outputs]).mean()
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
def train_dataloader(self):
return torch.utils.data.DataLoader(RandomDataset(32, 64))
def val_dataloader(self):
return torch.utils.data.DataLoader(RandomDataset(32, 64))
def test_dataloader(self):
return torch.utils.data.DataLoader(RandomDataset(32, 64))
if __name__ == '__main__':
pl.Trainer(gpus=None, max_epochs=20).fit(BoringModel(), torch.utils.data.DataLoader(RandomDataset(32, 500))) And I could not observe any memory leaking on gpu when training without. |
So, on my side, I did a couple of trials.
However, if I set the random dataset dimensions to 64 * 64 * 3 (the same as CelebA, the dataset I am working with), I do have the OOM. This happens if I set the batch size to 1 and to 128. See this gist for reference. Interestingly, though, if I set Edit: if I train using on GPU 1 only (so without DDP) and with CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9, I can observe on |
@dedeswim That is probably |
I forgot to mention that it also takes memory on GPU 1, so it is taking memory on both GPUs (around 10 GiB on GPU1 and 500 MiB on GPU0). In the afternoon (CET timezone) I'll do my best to create a reproducible script. |
Are there any updates on this? I believe this is somehow related with the issue here: https://discuss.pytorch.org/t/strange-number-of-processes-per-gpu/116927 Looks to me that there are N additional CUDA context processes for every GPU is used where N is the number of GPU used. Is there any solution to this? EDIT: when starting with "ddp_spawn" I can see a single context process on GPU but the training is basically stuck at the very beginning. I guess that's due to the fact I'm using |
@aleSuglia so far we are clueless about it. I'll try to pick this up again.
It is possible that it gets stuck for a while because of all the num_workers, this is normal for ddp_spawn. Alternative is just regular ddp, maybe try that too. |
Does it have anything to do with the fact that I'm using a custom function for splitting batches for truncated backprop?
|
no it should not be, unless you have a memory leak in this logic. But you don't observe memory growing, you get OOM at initialization.
that's normal, because in ddp each process sees just one gpu. in nvidia-smi you should see N processes and a few MB at the very beginning per gpu before training starts. |
How would I know this?
Shouldn't I have a single CUDA process per GPU instead of 4 per GPU? I have 4 GPU and there seems to be 4 CUDA processes for each of them!? |
Yes, you should have 4 processes not 16 :( |
@awaelchli this is the output of
would it be because I'm using 4 workers for my data loader? |
@aleSuglia No, these shouldn't be listed there since they only operate on CPU and don't require GPU memory/access. This has to be something from where we spawn processes. |
@justusschock well actually this happens with |
Your environment is free of any ddp related env variables, that's good. |
Also It would be really helpful for us to have a complete minimal example to reproduce this along with an output of nvidia-smi to see the processes. |
Yes that's correct.
I'm using the default PyTorch data loader. The only thing that might be using CUDA are some operations like |
But do you load them on CPU and all your tensors are on CPU as well? Then there should be no CUDA ops involved |
Sorry I missed this. Yes, everything is on CPU for sure. I'll try with the BoringModel as well. In the meantime, I can confirm that EDIT: I can't reproduce it with the BoringModel. I'm trying to understand whether it depends on my custom Dataset... |
@justusschock so I think the reason why it works with |
@justusschock ping :) |
I'm able to reproduce the same issue with lightning == 1.3.2 with Edit: This happens only in versions >= 1.3.0 . The script works fine in 1.2.9 |
@aleSuglia This shouldn't be the case. In the @aleSuglia @scarecrow1123 Could you try to add print(os.environ.get("PL_TRAINER_GPUS", None)) to the beginning of your script? This should is the environment variable we set when calling the script multiple times. Saying this, the variable should be empty the first time the script is originally called but should be filled afterwards. |
This error disappeared after I rebooted the machine. However, till I rebooted, this kept happening even when |
Hi |
Closing this for now. |
same issue like this: |
For future reference, I was able to solve this by rebooting the machine that was having the issue. This may not be the root of the problem, but it could be a solution for some particular cases. |
sudo fuser /dev/nvidia* and kill the processes listed above via pid. In my computer, the process is nvidia-smi. I don't know why the issue happen. But I really solve it in my compter. Hope it is helpful. |
In my case there was a |
🐛 Bug
Hey everyone,
I am trying to train a model on the GPU workstation of our lab (that has 10 GPUs, of which 1 only is usually in use) using Lightning ad DDP. I have tried with several models (including the BoringModel) without success. In particular, I get a CUDA OOM error when DDP initializes. I tried BoringModel with the following
Trainer
configuration:And the output I get is the following:
The script with the BoringModel I run on our workstation is in this gist.
However, this doesn't happen on Colab using your BoringModel notebook (my version can be found here).
I also tried to run locally the same notebook as Colab, and the result at the first attempt is the following:
At the second attempt, though, it works, as expected (i.e. the model trains with no errors, even with multiple GPUs)! So in the script, I tried to do the following to attempt the fit twice as in the notebook:
As a result, I get this stack trace:
Expected behavior
The models should train without issues.
Environment
Additional context
I tried installing torch, torchvision and pl with both Conda and PIP with fresh environments, and still no solution to this problem.
This happens also if I select (free) GPUs manually by specifying them in the
gpus
flag as aList[int]
. Also interestingly, if I run this tutorial notebook by PyTorch that uses vanilla PyTorch DDP, I have no issues whatsoever. Final interesting fact, settingaccelerator="dp"
I have no issues.Thanks in advance!
The text was updated successfully, but these errors were encountered: