-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
worker_init_fn is not set to seed dataloaders correctly when using DDP #7937
Comments
Thanks for testing this feature out! In seed_everything we have this line: os.environ["PL_SEED_WORKERS"] = f"{int(workers)}" should we change it to os.environ["PL_SEED_WORKERS"] = os.get("PL_SEED_WORKERS", f"{int(workers)}") ? And if so would you like to test if this change works for you? |
|
Maybe a better fix would be to actually change so that we don't have to ignore the argument when someone does: seed_everything(123, workers=True)
# .. later on for a second training with different dataloaders maybe??
seed_everything(123, workers=False) # don't ignore this, turn it actually off |
Right. That was my initial thought too. Also, what do you think about writing some log message in |
I think it's a good idea in general but note I think we can't do it for every worker, otherwise we get a large output with N * W messages where N is the number of GPUs and W is the number of workers per process. Maybe we can do two things:
|
would you be interested in sending a PR for the fix, or the log message, or both? :) happy to help wherever. |
I can do that. |
The pull request is created: #7942 Didn't have a chance to run the test suite yet. |
🐛 Bug
When
seed_everything(workers=True)
is called, it will set the environment variablePL_SEED_WORKERS=1
. ConsequentlyTrainer
will set theworker_init_fn
for dataloaders topl_worker_init_function
. It seems to me thatworker_init_fn
is not set when using DDP. The reason is that DDPPlugin.setup_environment() eventually runs reset_seed(), which reads the value of thePL_GLOBAL_SEED
environment value and calls seed_everything() with the default argumentworkers=False
.Please reproduce using the BoringModel
It's not possible to reproduce the issue in colab, since it doesn't support DDP.
To Reproduce
Here's a simple program that demonstrates the issue:
Expected behavior
I would expect
pl_worker_init_function
to be called. By printing something from the function, I can see that it's called if I usedp
accelerator, but not if I useddp
. I can also notice that the environment variablePL_SEED_WORKERS
is reset to0
during theTrainer.fit()
call, but I would expect it to have the value1
in the end.I think the correct fix would be to make reset_seed() read the
PL_SEED_WORKERS
environment variable too and pass the correspondingworkers
argument toseed_everything()
. However, I'm not familiar enough with the code to be sure that this is correct.Preferably
pl_worker_init_function
would also display a log message that confirms that the workers are seeded correctly.Environment
Additional context
Recently there was discussion about an issue with data loading, where the same NumPy random seed is used across different workers. This causes the workers the use the same random numbers for data transforms. A fix was quickly introduced in PyTorch Lightning that seeds the dataloaders correctly by automatically setting the
worker_init_fn
for dataloaders.The text was updated successfully, but these errors were encountered: