Warning originating in C10 backend does not get translated to Python warning if run from subprocess #75725
Labels
high priority
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triage review
🐛 Describe the bug
Hi,
I want to record a warning in Python, that is originating in C10 portion of the code (
TORCH_WARN_ONCE
), while running in a subprocess because of DDP. However, it seems that this warning is impossible to catch because it does not propagate to Python correctly. Below is a simple demo, that is mostly taken from this tutorial and adapted to catching warnings.Code and output with warnings
Output:
However, if I do some intentional mistake in order to raise an Exception in the similar code path (such as changing the size of tensors so that they do not match anymore), the Exception is correctly propagated to to Python as a
RuntimeError
, see the modified codeCode and output with Exception
Output:
The issue was first reported on PyTorch slack, cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @albanD , @ezyang , @mruberry , it is most likely linked to this issue: #72948
Thanks a lot!
Versions
Collecting environment information...
PyTorch version: 1.11.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Arch Linux (x86_64)
GCC version: (GCC) 11.2.0
Clang version: Could not collect
CMake version: version 3.23.0
Libc version: glibc-2.35
Python version: 3.9.11 (main, Apr 7 2022, 15:33:34) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.17.1-zen1-1-zen-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.6.112
GPU models and configuration: GPU 0: NVIDIA T1200 Laptop GPU
Nvidia driver version: 510.60.02
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.8.3.3
/usr/lib/libcudnn_adv_infer.so.8.3.3
/usr/lib/libcudnn_adv_train.so.8.3.3
/usr/lib/libcudnn_cnn_infer.so.8.3.3
/usr/lib/libcudnn_cnn_train.so.8.3.3
/usr/lib/libcudnn_ops_infer.so.8.3.3
/usr/lib/libcudnn_ops_train.so.8.3.3
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy==0.942
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.22.3
[pip3] pytorch-lightning==1.7.0.dev0
[pip3] torch==1.11.0+cu113
[pip3] torchmetrics==0.7.3
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0+cu113
[conda] Could not collect
The text was updated successfully, but these errors were encountered: