-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process runs on more GPUs than specified #958
Comments
Hey, thanks for your contribution! Great first issue! |
Thx for comment, I do not think that the training is fully running on the GPU0, just some memory allocation... Could you also share the GPU utilization during the training process? |
I also think that the training is likely not running on GPU0 but not sure why the pid shows up on it
|
BTW I switched the distributed backend from dp (default) to ddp and this went away. No PID is shown on GPU0 and its memory usage is at 11MiB (same as any other inactive GPU) |
Ok, in such case I would assume it as resolved, but feel free to reopne it if you need to 🤖 |
* SA: for #958: set torch cuda device when finding root * SA: for #958: removing root gpu hack in trainer/evaluation_loop * SA: setting torch cuda device * comment line too long * check if root gpu exists or available * Incorporating suggestions on #1094 * since root gpu returns none instead of -1 for cpu * undo changes * fixed dp memory thing Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
* SA: for Lightning-AI#958: set torch cuda device when finding root * SA: for Lightning-AI#958: removing root gpu hack in trainer/evaluation_loop * SA: setting torch cuda device * comment line too long * check if root gpu exists or available * Incorporating suggestions on Lightning-AI#1094 * since root gpu returns none instead of -1 for cpu * undo changes * fixed dp memory thing Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
I have cloned the repository yesterday (pytorch-lightning==0.7.4.dev0) and there are some edge cases that are still not fixed by #1349. Below is minimal code for reproduction: import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
import pytorch_lightning as pl
class Model(pl.LightningModule):
def __init__(self):
super().__init__()
self.l1 = torch.nn.Linear(1000, 10)
def forward(self, x):
return torch.relu(self.l1(x))
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
return {'loss': F.cross_entropy(y_hat, y)}
def training_epoch_end(self, outputs):
avg_loss = torch.stack([x['loss'] for x in outputs]).mean()
return {'avg_loss': avg_loss}
def train_dataloader(self):
data = torch.rand(4096, 1000)
labels = torch.randint(high=10, size=(4096,))
return DataLoader(list(zip(data, labels)), batch_size=64, pin_memory=True)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.001)
trainer = pl.Trainer(gpus=[3])
model = Model()
trainer.fit(model) After running the above code,
I have tested a few scenarios, and found that this is caused by two factors:
These factors must happen together for the problem to arise. For example, if Similarly, the problem happens if the validation phase is defined together with I am afraid I will not have time to dig deeper here. But hopefully the maintainers will find this usefull. |
This issue occurred to me during validation sanity check even if I had to use this as a temporary fix
However, if I call
|
* SA: for Lightning-AI#958: set torch cuda device when finding root * SA: for Lightning-AI#958: removing root gpu hack in trainer/evaluation_loop * SA: setting torch cuda device * comment line too long * check if root gpu exists or available * Incorporating suggestions on Lightning-AI#1094 * since root gpu returns none instead of -1 for cpu * undo changes * fixed dp memory thing Co-authored-by: Shubham Agarwal <shubhamagarwal92@gmail.com>
I have a single 8-GPU machine with a faulty GPU0.
I'm running imagenet_example.py on 7 GPUs on this machine by specifying
gpus=[1,2,3,4,5,6,7]
in the Trainer i.e. I do not want to use GPU0However, when i run
nvidia-smi
, I see the Trainer's pid shows on all 8 GPUs, just with lower memory on GPU0 (see output below). I also find it to be slower than non-PL code by about 4x. I don't see this behavior if I manually setCUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7
followed bygpus=7
in Trainer. Similarly, it works fine when using a single GPU with, say,gpus=[1]
.I'm not sure if it's relevant but I also see
gpu=0
in the tqdm progress barnvidia-smi with Trainer(gpus=[1,2,3,4,5,6,7]) and CUDA_VISIBLE_DEVICES unset
nvidia-smi with Trainer(gpus=7) and CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7
Expected behavior
The process should run on the specified GPUs without manually setting CUDA_VISIBLE_DEVICES
Environment
The text was updated successfully, but these errors were encountered: