-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusing recommendation to use sync_dist=True even with TorchMetrics #20153
Comments
Yes that's right, the warning shouldn't occur when logging TorchMetrics. Does it occur only with MetricCollection or a regular Metric too? |
Thank you for your reply! I will be able to check this tomorrow, and will report back. Meanwhile, my second suspicion is that since I log I will try to check this hypothesis, too. |
If you pass in scalar tensors then not of course. Then the warning is normal and expected. For logging TorchMetrics you would just pass in the metric directly into |
I guess this is exactly my case. I don't exactly remember why I have So I guess it's not a bug then, thank you for clarifying this! Now I have just a couple more questions:
Thank you! |
Actually, this still happens when I log all the metric properly, without using
and the following logging code:
I get
So here, I will test whether this still happens without |
This still happens when logged "properly" (without |
Can you show it with a runnable example based on https://github.com/Lightning-AI/pytorch-lightning/blob/master/examples/pytorch/bug_report/bug_report_model.py? |
I get the same thing when logging using the manual method. Going to try directly logging with the metric, but does that support ClasswiseWrapper? |
Hello, i can confirm the confusion. I am just training on 2 GPUs and cannot find any documentation on how to use MetricCollection in distributed environments. Im not using sync_dist, so getting the same warning and i am not sure if my metrics are computed / logged properly |
Bug description
Hello!
When I train and validate a model in a multi-GPU setting (HPC, sbatch job that requests multiple GPUs on a single node), I use
self.log(..., sync_dist=True)
when logging PyTorch losses, and don't specify any value forsync_dist
when logging metrics from TorchMetrics library. However, I still get warnings likeThese specific messages correspond to logging
tmc.MulticlassRecall(len(self.task.class_names), average="macro", ignore_index=self.metric_ignore_index)
and individual components oftmc.MulticlassRecall(len(self.task.class_names), average="none", ignore_index=self.metric_ignore_index)
.Full code listing for metric object definitions and logging is provided in the "reproducing the bug" section.
As I understand from a note here, and from discussion here, one doesn't typically need to explicitly use
sync_dist
when using TorchMetrics.I wonder if I still need to enable
sync_dist=True
as advised in the warnings due to some special case that I am not aware about, or should I follow the docs and keep it as is? In any case, this is probably a bug, either in documentation, or in warning code.Thank you!
What version are you seeing the problem on?
2.3.0
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
cc @carmocca
The text was updated successfully, but these errors were encountered: