worker_init_fn is not set to seed dataloaders correctly when using DDP #7937

senarvi · 2021-06-11T06:19:50Z

🐛 Bug

When seed_everything(workers=True) is called, it will set the environment variable PL_SEED_WORKERS=1. Consequently Trainer will set the worker_init_fn for dataloaders to pl_worker_init_function. It seems to me that worker_init_fn is not set when using DDP. The reason is that DDPPlugin.setup_environment() eventually runs reset_seed(), which reads the value of the PL_GLOBAL_SEED environment value and calls seed_everything() with the default argument workers=False.

Please reproduce using the BoringModel

It's not possible to reproduce the issue in colab, since it doesn't support DDP.

To Reproduce

Here's a simple program that demonstrates the issue:

import os

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning import seed_everything


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    seed_everything(1234, workers=True)
    # Sets PL_SEED_WORKERS=1
    print('PL_SEED_WORKERS=' + os.environ['PL_SEED_WORKERS'])

    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        weights_summary=None,
        gpus=2,
        accelerator='ddp'  # Using accelerator='dp' works
    )
    trainer.fit(model, train_dataloader=train_data, val_dataloaders=val_data)
    # Trainer.accelerator.setup_environment() calls DDPPlugin.setup_environment(),
    # which eventually runs seed_everything(workers=False) that sets PL_SEED_WORKERS=0
    # Consequently dataloader.worker_init_fn is not set.
    print('PL_SEED_WORKERS=' + os.environ['PL_SEED_WORKERS'])


if __name__ == '__main__':
    run()

Expected behavior

I would expect pl_worker_init_function to be called. By printing something from the function, I can see that it's called if I use dp accelerator, but not if I use ddp. I can also notice that the environment variable PL_SEED_WORKERS is reset to 0 during the Trainer.fit() call, but I would expect it to have the value 1 in the end.

I think the correct fix would be to make reset_seed() read the PL_SEED_WORKERS environment variable too and pass the corresponding workers argument to seed_everything(). However, I'm not familiar enough with the code to be sure that this is correct.

Preferably pl_worker_init_function would also display a log message that confirms that the workers are seeded correctly.

Environment

CUDA:
- GPU:
  - NVIDIA Tesla V100-SXM2-16GB
  - NVIDIA Tesla V100-SXM2-16GB
- available: True
- version: 11.0
Packages:
- numpy: 1.19.2
- pyTorch_debug: True
- pyTorch_version: 1.7.0
- pytorch-lightning: 1.4.0dev
- tqdm: 4.51.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.8
- version: Enable any ML experiment tracking framework #47-Ubuntu SMP Tue May 11 15:51:42 UTC 2021

Additional context

Recently there was discussion about an issue with data loading, where the same NumPy random seed is used across different workers. This causes the workers the use the same random numbers for data transforms. A fix was quickly introduced in PyTorch Lightning that seeds the dataloaders correctly by automatically setting the worker_init_fn for dataloaders.

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-06-11T09:39:42Z

Thanks for testing this feature out!

In seed_everything we have this line:

os.environ["PL_SEED_WORKERS"] = f"{int(workers)}"

should we change it to

os.environ["PL_SEED_WORKERS"] = os.get("PL_SEED_WORKERS", f"{int(workers)}")

? And if so would you like to test if this change works for you?

senarvi · 2021-06-11T10:11:38Z

should we change it to
os.environ["PL_SEED_WORKERS"] = os.get("PL_SEED_WORKERS", f"{int(workers)}")
? And if so would you like to test if this change works for you?

os.environ["PL_SEED_WORKERS"] = os.environ.get("PL_SEED_WORKERS", f"{int(workers)}") works. It means that seed_everything() will ignore the workers argument if PL_SEED_WORKERS is set, but it works too (I already tested).

awaelchli · 2021-06-11T10:43:31Z

It means that seed_everything() will ignore the workers argument if PL_SEED_WORKERS is set,

Maybe a better fix would be to actually change reset_seed() to call seed_everything with workers=bool(os.environ.get("PL_SEED_WORKERS", False))

so that we don't have to ignore the argument when someone does:

seed_everything(123, workers=True)

# .. later on for a second training with different dataloaders maybe??
seed_everything(123, workers=False) # don't ignore this, turn it actually off

senarvi · 2021-06-11T10:54:01Z

Maybe a better fix would be to actually change reset_seed() to call seed_everything with workers=bool(os.environ.get("PL_SEED_WORKERS", False))

Right. That was my initial thought too. Also, what do you think about writing some log message in pl_worker_init_function() to confirm that the data loaders have been initialized correctly?

awaelchli · 2021-06-11T11:01:35Z

I think it's a good idea in general but note I think we can't do it for every worker, otherwise we get a large output with N * W messages where N is the number of GPUs and W is the number of workers per process. Maybe we can do two things:

only log it with DEBUG level inside the worker_init_fn (so not visible by default unless user sets logging level to debug) and
additionally where we set the worker_init_fn we confirm it was set on the dataloader, this time with regular log level

awaelchli · 2021-06-11T11:02:52Z

would you be interested in sending a PR for the fix, or the log message, or both? :) happy to help wherever.

senarvi · 2021-06-11T11:08:51Z

would you be interested in sending a PR for the fix, or the log message, or both? :) happy to help wherever.

I can do that.

senarvi · 2021-06-11T12:50:34Z

The pull request is created: #7942

Didn't have a chance to run the test suite yet.

senarvi added bug Something isn't working help wanted Open to be worked on labels Jun 11, 2021

awaelchli added this to the v1.3.x milestone Jun 11, 2021

awaelchli assigned senarvi and awaelchli Jun 11, 2021

senarvi mentioned this issue Jun 11, 2021

Seed all workers when using DDP #7942

Merged

11 tasks

tchaton added the priority: 0 High priority task label Jun 14, 2021

tchaton closed this as completed in #7942 Jun 14, 2021

senarvi mentioned this issue Jun 15, 2021

DDP with 2 GPUs doesn't give same results as 1 GPU with the same effective batch size #6789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

worker_init_fn is not set to seed dataloaders correctly when using DDP #7937

worker_init_fn is not set to seed dataloaders correctly when using DDP #7937

senarvi commented Jun 11, 2021

awaelchli commented Jun 11, 2021 •

edited

Loading

senarvi commented Jun 11, 2021

awaelchli commented Jun 11, 2021 •

edited

Loading

senarvi commented Jun 11, 2021

awaelchli commented Jun 11, 2021 •

edited

Loading

awaelchli commented Jun 11, 2021

senarvi commented Jun 11, 2021

senarvi commented Jun 11, 2021

worker_init_fn is not set to seed dataloaders correctly when using DDP #7937

worker_init_fn is not set to seed dataloaders correctly when using DDP #7937

Comments

senarvi commented Jun 11, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

awaelchli commented Jun 11, 2021 • edited Loading

senarvi commented Jun 11, 2021

awaelchli commented Jun 11, 2021 • edited Loading

senarvi commented Jun 11, 2021

awaelchli commented Jun 11, 2021 • edited Loading

awaelchli commented Jun 11, 2021

senarvi commented Jun 11, 2021

senarvi commented Jun 11, 2021

awaelchli commented Jun 11, 2021 •

edited

Loading

awaelchli commented Jun 11, 2021 •

edited

Loading

awaelchli commented Jun 11, 2021 •

edited

Loading