deadlock using Wandb #8145

linnanwang · 2022-06-07T04:00:02Z

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug
Hello mmdet developers,

We found the training loop can be dead lock in some places if we use multiGPU training and enable wandb tracking. Single GPU works perfectly fine. I only tested with YOLOX. Please see the command below.

Reproduction

What command or script did you run?

./tools/dist_train.sh ./configs/yolox/yolox_s_8x8_300e_coco.py 2

Did you make any modifications on the code or config? Did you understand what you have modified?
No
What dataset did you use?
MSCOCO

Environment

Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
GPU 0,1: Quadro GV100
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.3.r11.3/compiler.29745058_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.0
OpenCV: 4.5.5
MMCV: 1.4.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.25.0+ca11860

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)
  We used the provided docker.

Error traceback
If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

The text was updated successfully, but these errors were encountered:

hhaAndroid · 2022-06-07T06:04:54Z

@linnanwang Thanks for your feedback. Could you please help me to verify it @ayulockin? My machines are all disconnected from the network, there is no way to verify.

hhaAndroid · 2022-06-07T06:08:39Z

@linnanwang Are you referring to the configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py added by the MMDetWandbHook configuration？

linnanwang · 2022-06-07T06:12:14Z

@hhaAndroid Thanks for the quick response. I'm referring to ./configs/yolox/yolox_s_8x8_300e_coco.py.

ayulockin · 2022-06-07T08:01:46Z

Hey @linnanwang thanks for raising this. So as I understand, using MMDetWandbHook is not working properly in a multi-GPU setting.

Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know.

Fizzez · 2022-06-07T09:19:30Z

Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know.

Hi @ayulockin I'm facing the same issue here. There was no bug reported but I found the GPU utilization went to 100% with 0 "GPU Time Spent Accessing Memory" (which indicated the deadlock if I understand correctly?). Then the run hung-up.

It happened by the last iteration of the 1st epoch. I was wondering if this issue was related to #6486 until I came here.

By the way, is the one from mmcv (https://github.com/open-mmlab/mmcv/blob/ea173c9f07f0abf6873d2b7d786fb6411843cf00/mmcv/runner/hooks/logger/wandb.py) workable with multi-GPU?

ayulockin · 2022-06-07T09:31:32Z

Thanks for more info @Fizzez.

Ideally MMCV's WandbLoggerHook should work in a multi GPU setting given the methods are decorated with @master_only. The same applies to this hook. I am investigating.

Fizzez · 2022-06-07T09:38:38Z

@ayulockin Thank you for the quick reply.

Ideally MMCV's WandbLoggerHook should work in a multi GPU setting given the methods are decorated with @master_only.

I see. I thought the MMDetWandbHook could not run with multi GPU env. My mistake.

If it's possible, could you please also share any ideas may point to this issue? Actually I am working on it and need a quick fix.

hhaAndroid · 2022-06-07T12:08:29Z

@ayulockin Thank you very much for helping to check.

ayulockin · 2022-06-07T13:38:00Z

Hey @linnanwang, @Fizzez, I tried training the model on 2 P100 GPUs by doing this: bash tools/dist_train.sh configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py 2

I couldn't reproduce the deadlock issue. If you check out the system metrics in this W&B run page, you will see that memory is allocated for both the GPUs and that both are used for training.

linnanwang · 2022-06-07T15:02:14Z

perhaps it is a unique problem of ./configs/yolox/yolox_s_8x8_300e_coco.py?

Fizzez · 2022-06-07T15:34:52Z

In my case I used ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py
Also I have checked running with MMCV's WandbLoggerHook was fine. It may not be a unique problem ..

linnanwang · 2022-06-07T15:55:45Z

Perhaps this is a problem of this particular model, could you take a look at yolox_s_8x8_300e_coco?

MilkClouds · 2022-08-09T06:24:05Z

I experience same phenomena(deadlock over 30 minute) on dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py, only on distributed learning(and more than 1 gpu) setting.
1 gpu training is okay.

haruishi43 · 2022-10-03T12:24:57Z

I faced the same problem with mmseg using MMSegWandbHook (open-mmlab/mmsegmentation#2137). Oddly, this always occurs right after the first output of the TextLoggerHook

haruishi43 · 2022-10-03T13:12:01Z

In my case, I found out that putting MMSegWandbHook before other loggers (i.e., TextLoggerHook) solved the issue of deadlock. Maybe this might solve your issue as well if you put MMDetWandbHook on top @linnanwang @MilkClouds

Fizzez · 2022-12-06T11:24:04Z

Finally managed to solve this by setting reset_flag=True in TextLoggerHook. (Although it's kind of an ugly fix..)

i.e. use config like the following:

log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', reset_flag=True),
        dict(type='MMDetWandbHook', 'Set MMDetWandbHook properly here')
    ])

As far as I investigated, the deadlock was caused by dist.all_reduce() here:
https://github.com/open-mmlab/mmdetection/blob/v2.25.0/mmdet/models/detectors/base.py#L204

The reason seems to be:

reset_flag of the logger hook which has the lowest priority should be set to True in order to clear the runner.log_buffer after logging is done.
- https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/base.py#L137
If you put MMDetWandbHook after the TextLoggerHook in your config file. MMDetWandbHook has the lowest priority thus its reset_flag is True, in all processes for all the GPUs.
While setting reset_flag=True for MMDetWandbHook is fine on GPU 0, the real LAST HOOK on GPUs besides 0, TextLoggerHook, didn't have reset_flag=True properly set.
- because MMDetWandbHook is basically master_only
Consequently, on GPUs besides 0, TextLoggerHook never clear runner.log_buffer and execute dist.reduce(mem_mb, 0, op=dist.ReduceOp.MAX) to collect memory size, ignoring the logging interval. dist.reduce seems to have blocked dist.all_reduce() which I mentioned in the beginning... But I'm not quite sure.
- https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/text.py#L116

p.s. I used the following environment

mmdet==2.25.0
- ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py
mmcv==1.5.0
wandb==0.12.17
multi-GPU machine

MilkClouds · 2022-12-10T17:35:02Z

(New) Error report and analysis

Hi, I had different type of error but seems to occur because of same reason.

I ran bash tools/dist_train.sh configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py 2 and met error saying AssertionError: loss log variables are different across GPUs! occured.

here mask_rcnn_r50_fpn_1x_wandb_coco.py.txt is a log text.

And this error occur right after 50 iteration. (if you see error log, you can view 2022-12-10 16:27:07,287 - mmdet - INFO - Epoch [1][50/29317] lr: 1.978e-03, eta: 1 day, 12:27:04, time: 0.373, data_time: 0.103, memory: 3889, loss_rpn_cls: 0.5394, loss_rpn_bbox: 0.1028, loss_cls: 1.1100, acc: 86.3379, loss_bbox: 0.0865, loss_mask: 0.7602, loss: 2.5990 is printed before error message)

I can check console output below when I inserted print(self, log_dict, runner.log_buffer.output) in TextLoggerHook/log function.

2022-12-10 17:04:02,551 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7f49792cb2e0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 50)]) OrderedDict([('data_time', 0.10089991092681885), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.37457459926605224)])
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7fc47c4f93a0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 50)]) OrderedDict([('data_time', 0.11627291202545166), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.7914962005615235)])
2022-12-10 17:04:16,786 - mmdet - INFO - Epoch [1][50/29317]    lr: 1.978e-03, eta: 1 day, 12:35:58, time: 0.375, data_time: 0.101, memory: 3901, loss_rpn_cls: 0.5408, loss_rpn_bbox: 0.1029, loss_cls: 1.1096, acc: 86.3281, loss_bbox: 0.0862, loss_mask: 0.7601, loss: 2.5996
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7fc47c4f93a0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 51)]) OrderedDict([('data_time', 0.11627291202545166), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.7914962005615235)])

We can check that TextLoggerHook for gpu 1 is called on Iteration 51, which is not intended. It's because since gpu 1(or 2,3,4,...)'s runner.log_buffer is not cleared, https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/base.py#L137 is satisfied, and self.log(runner) is called.

As long as I guess, the reason why AssertionError: loss log variables are different across GPUs! occured instead of deadlock is due to pytorch version. I use pytorch 1.8.2 this time, and when I experienced deadlock, I used pytorch 1.12.0. I guess some different implementation of dist.all_reduce btw different pytorch version may make different result, but I'm not quite sure.

The reason why error occurs (summarization)

I agree with @Fizzez's opinion.

When we set order of hook to TextLoggerHook -> MMDetWandbHook,
MMDetWandbHook's reset_flag set to true
- https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/base.py#L137
on GPU 0, MMDetWandbHook's runner.log_buffer.clear_output() on LoggerHook is called.
meanwhile, on GPU 1(2,3,4,...), MMDetWandbHook's runner.log_buffer.clear_output() on LoggerHook is NOT called, since MMDetWandbHook is @master_only.
since GPU 1(2,3,4,...)'s runner.log_buffer is not cleared, self.log of TextLoggerHook is continuously called after first log(even when it's not timing to log)
in TextLoggerHook/self.log, there exists a dist.reduce function.
- https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/text.py#L116
Since TextLoggerHook/self.log is not called (only) on GPU 0, dist.reduce makes deadlock. (deadlock on dist.reduce on GPU 1,2,3,..., and deadlock on dist.all_reduce on GPU 0)
- But depending on pytorch version, this problem make global deadlock(on 1.12.0), or local deadlock(on 1.8.2, which leds to AssertionError: loss log variables are different across GPUs!). Detailed reason why different error occurs is not quite sure.

Solution

Solution is simple. Make MMDetWandbHook to clear runner.log_buffer on every process, including process for GPU 1,2,3...

Since I've figured out why this error occurs, I'm going to make PR for it, but since @Fizzez figured out the key reason, I'm going to mention you as a co-author. Is there no problem if I do this? I'm also going to make PR on mmseg too(if possible)

Fizzez · 2022-12-11T02:32:42Z

Hi @MilkClouds , thank you for the analysis. Glad to see that we have the same option on this. Your solution actually makes more sense by letting MMDetWandbHook's runner.log_buffer.clear_output() get called.
Also thank you for mentioning me in your PR. I really appreciate it.

hankyul2 · 2024-01-21T06:14:11Z

@MilkClouds Thank you for your solution, it saves my days.

mm-assistant bot assigned hhaAndroid Jun 7, 2022

hhaAndroid added the bug Something isn't working label Jun 7, 2022

MilkClouds mentioned this issue Dec 10, 2022

Fixed deadlock issue related with MMDetWandbHook #9476

Merged

MilkClouds mentioned this issue Dec 11, 2022

Fixed deadlock issue related with MMSegWandbHook open-mmlab/mmsegmentation#2398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock using Wandb #8145

deadlock using Wandb #8145

linnanwang commented Jun 7, 2022

hhaAndroid commented Jun 7, 2022

hhaAndroid commented Jun 7, 2022

linnanwang commented Jun 7, 2022

ayulockin commented Jun 7, 2022

Fizzez commented Jun 7, 2022

ayulockin commented Jun 7, 2022

Fizzez commented Jun 7, 2022

hhaAndroid commented Jun 7, 2022

ayulockin commented Jun 7, 2022

linnanwang commented Jun 7, 2022

Fizzez commented Jun 7, 2022

linnanwang commented Jun 7, 2022

MilkClouds commented Aug 9, 2022 •

edited

Loading

haruishi43 commented Oct 3, 2022

haruishi43 commented Oct 3, 2022 •

edited

Loading

Fizzez commented Dec 6, 2022

MilkClouds commented Dec 10, 2022 •

edited

Loading

Fizzez commented Dec 11, 2022

hankyul2 commented Jan 21, 2024

deadlock using Wandb #8145

deadlock using Wandb #8145

Comments

linnanwang commented Jun 7, 2022

hhaAndroid commented Jun 7, 2022

hhaAndroid commented Jun 7, 2022

linnanwang commented Jun 7, 2022

ayulockin commented Jun 7, 2022

Fizzez commented Jun 7, 2022

ayulockin commented Jun 7, 2022

Fizzez commented Jun 7, 2022

hhaAndroid commented Jun 7, 2022

ayulockin commented Jun 7, 2022

linnanwang commented Jun 7, 2022

Fizzez commented Jun 7, 2022

linnanwang commented Jun 7, 2022

MilkClouds commented Aug 9, 2022 • edited Loading

haruishi43 commented Oct 3, 2022

haruishi43 commented Oct 3, 2022 • edited Loading

Fizzez commented Dec 6, 2022

MilkClouds commented Dec 10, 2022 • edited Loading

(New) Error report and analysis

The reason why error occurs (summarization)

Solution

Fizzez commented Dec 11, 2022

hankyul2 commented Jan 21, 2024

MilkClouds commented Aug 9, 2022 •

edited

Loading

haruishi43 commented Oct 3, 2022 •

edited

Loading

MilkClouds commented Dec 10, 2022 •

edited

Loading