Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock using Wandb #8145

Open
linnanwang opened this issue Jun 7, 2022 · 19 comments
Open

deadlock using Wandb #8145

linnanwang opened this issue Jun 7, 2022 · 19 comments
Assignees
Labels
bug Something isn't working

Comments

@linnanwang
Copy link

Thanks for your error report and we appreciate it a lot.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.
  3. The bug has not been fixed in the latest version.

Describe the bug
Hello mmdet developers,

We found the training loop can be dead lock in some places if we use multiGPU training and enable wandb tracking. Single GPU works perfectly fine. I only tested with YOLOX. Please see the command below.

Reproduction

  1. What command or script did you run?
./tools/dist_train.sh ./configs/yolox/yolox_s_8x8_300e_coco.py 2
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    No
  2. What dataset did you use?
    MSCOCO

Environment

  1. Please run python mmdet/utils/collect_env.py to collect necessary environment information and paste it here.
    sys.platform: linux
    Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
    CUDA available: True
    GPU 0,1: Quadro GV100
    CUDA_HOME: /usr/local/cuda
    NVCC: Build cuda_11.3.r11.3/compiler.29745058_0
    GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
    PyTorch: 1.10.0
    PyTorch compiling details: PyTorch built with:
  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.0
OpenCV: 4.5.5
MMCV: 1.4.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.25.0+ca11860

  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)
      We used the provided docker.

Error traceback
If applicable, paste the error trackback here.

A placeholder for trackback.

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

@hhaAndroid
Copy link
Collaborator

@linnanwang Thanks for your feedback. Could you please help me to verify it @ayulockin? My machines are all disconnected from the network, there is no way to verify.

@hhaAndroid hhaAndroid added the bug Something isn't working label Jun 7, 2022
@hhaAndroid
Copy link
Collaborator

@linnanwang Are you referring to the configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py added by the MMDetWandbHook configuration?

@linnanwang
Copy link
Author

@hhaAndroid Thanks for the quick response. I'm referring to ./configs/yolox/yolox_s_8x8_300e_coco.py.

@ayulockin
Copy link
Contributor

Hey @linnanwang thanks for raising this. So as I understand, using MMDetWandbHook is not working properly in a multi-GPU setting.

Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know.

@Fizzez
Copy link

Fizzez commented Jun 7, 2022

Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know.

Hi @ayulockin I'm facing the same issue here. There was no bug reported but I found the GPU utilization went to 100% with 0 "GPU Time Spent Accessing Memory" (which indicated the deadlock if I understand correctly?). Then the run hung-up.

It happened by the last iteration of the 1st epoch. I was wondering if this issue was related to #6486 until I came here.

By the way, is the one from mmcv (https://github.com/open-mmlab/mmcv/blob/ea173c9f07f0abf6873d2b7d786fb6411843cf00/mmcv/runner/hooks/logger/wandb.py) workable with multi-GPU?

@ayulockin
Copy link
Contributor

Thanks for more info @Fizzez.

Ideally MMCV's WandbLoggerHook should work in a multi GPU setting given the methods are decorated with @master_only. The same applies to this hook. I am investigating.

@Fizzez
Copy link

Fizzez commented Jun 7, 2022

@ayulockin Thank you for the quick reply.

Ideally MMCV's WandbLoggerHook should work in a multi GPU setting given the methods are decorated with @master_only.

I see. I thought the MMDetWandbHook could not run with multi GPU env. My mistake.

If it's possible, could you please also share any ideas may point to this issue? Actually I am working on it and need a quick fix.

@hhaAndroid
Copy link
Collaborator

@ayulockin Thank you very much for helping to check.

@ayulockin
Copy link
Contributor

Hey @linnanwang, @Fizzez, I tried training the model on 2 P100 GPUs by doing this: bash tools/dist_train.sh configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py 2

I couldn't reproduce the deadlock issue. If you check out the system metrics in this W&B run page, you will see that memory is allocated for both the GPUs and that both are used for training.

@linnanwang
Copy link
Author

perhaps it is a unique problem of ./configs/yolox/yolox_s_8x8_300e_coco.py?

@Fizzez
Copy link

Fizzez commented Jun 7, 2022

In my case I used ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py
Also I have checked running with MMCV's WandbLoggerHook was fine. It may not be a unique problem ..

@linnanwang
Copy link
Author

Perhaps this is a problem of this particular model, could you take a look at yolox_s_8x8_300e_coco?

@MilkClouds
Copy link

MilkClouds commented Aug 9, 2022

I experience same phenomena(deadlock over 30 minute) on dyhead/atss_swin-l-p4-w12_fpn_dyhead_mstrain_2x_coco.py, only on distributed learning(and more than 1 gpu) setting.
1 gpu training is okay.

@haruishi43
Copy link

I faced the same problem with mmseg using MMSegWandbHook (open-mmlab/mmsegmentation#2137). Oddly, this always occurs right after the first output of the TextLoggerHook

@haruishi43
Copy link

haruishi43 commented Oct 3, 2022

In my case, I found out that putting MMSegWandbHook before other loggers (i.e., TextLoggerHook) solved the issue of deadlock. Maybe this might solve your issue as well if you put MMDetWandbHook on top @linnanwang @MilkClouds

@Fizzez
Copy link

Fizzez commented Dec 6, 2022

Finally managed to solve this by setting reset_flag=True in TextLoggerHook. (Although it's kind of an ugly fix..)

i.e. use config like the following:

log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', reset_flag=True),
        dict(type='MMDetWandbHook', 'Set MMDetWandbHook properly here')
    ])

As far as I investigated, the deadlock was caused by dist.all_reduce() here:
https://github.com/open-mmlab/mmdetection/blob/v2.25.0/mmdet/models/detectors/base.py#L204

The reason seems to be:

  • reset_flag of the logger hook which has the lowest priority should be set to True in order to clear the runner.log_buffer after logging is done.
  • If you put MMDetWandbHook after the TextLoggerHook in your config file. MMDetWandbHook has the lowest priority thus its reset_flag is True, in all processes for all the GPUs.
  • While setting reset_flag=True for MMDetWandbHook is fine on GPU 0, the real LAST HOOK on GPUs besides 0, TextLoggerHook, didn't have reset_flag=True properly set.
    • because MMDetWandbHook is basically master_only
  • Consequently, on GPUs besides 0, TextLoggerHook never clear runner.log_buffer and execute dist.reduce(mem_mb, 0, op=dist.ReduceOp.MAX) to collect memory size, ignoring the logging interval. dist.reduce seems to have blocked dist.all_reduce() which I mentioned in the beginning... But I'm not quite sure.

p.s. I used the following environment

  • mmdet==2.25.0
    • ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py
  • mmcv==1.5.0
  • wandb==0.12.17
  • multi-GPU machine

@MilkClouds
Copy link

MilkClouds commented Dec 10, 2022

(New) Error report and analysis

Hi, I had different type of error but seems to occur because of same reason.

I ran bash tools/dist_train.sh configs/mask_rcnn/mask_rcnn_r50_fpn_1x_wandb_coco.py 2 and met error saying AssertionError: loss log variables are different across GPUs! occured.

here mask_rcnn_r50_fpn_1x_wandb_coco.py.txt is a log text.

And this error occur right after 50 iteration. (if you see error log, you can view 2022-12-10 16:27:07,287 - mmdet - INFO - Epoch [1][50/29317] lr: 1.978e-03, eta: 1 day, 12:27:04, time: 0.373, data_time: 0.103, memory: 3889, loss_rpn_cls: 0.5394, loss_rpn_bbox: 0.1028, loss_cls: 1.1100, acc: 86.3379, loss_bbox: 0.0865, loss_mask: 0.7602, loss: 2.5990 is printed before error message)

I can check console output below when I inserted print(self, log_dict, runner.log_buffer.output) in TextLoggerHook/log function.

2022-12-10 17:04:02,551 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7f49792cb2e0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 50)]) OrderedDict([('data_time', 0.10089991092681885), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.37457459926605224)])
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7fc47c4f93a0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 50)]) OrderedDict([('data_time', 0.11627291202545166), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.7914962005615235)])
2022-12-10 17:04:16,786 - mmdet - INFO - Epoch [1][50/29317]    lr: 1.978e-03, eta: 1 day, 12:35:58, time: 0.375, data_time: 0.101, memory: 3901, loss_rpn_cls: 0.5408, loss_rpn_bbox: 0.1029, loss_cls: 1.1096, acc: 86.3281, loss_bbox: 0.0862, loss_mask: 0.7601, loss: 2.5996
<mmcv.runner.hooks.logger.text.TextLoggerHook object at 0x7fc47c4f93a0> OrderedDict([('mode', 'train'), ('epoch', 1), ('iter', 51)]) OrderedDict([('data_time', 0.11627291202545166), ('loss_rpn_cls', 0.5407923412322998), ('loss_rpn_bbox', 0.10289064198732376), ('loss_cls', 1.1095942270755768), ('acc', 86.328125), ('loss_bbox', 0.08622131874784827), ('loss_mask', 0.7600687146186829), ('loss', 2.5995672702789308), ('time', 0.7914962005615235)])

We can check that TextLoggerHook for gpu 1 is called on Iteration 51, which is not intended. It's because since gpu 1(or 2,3,4,...)'s runner.log_buffer is not cleared, https://github.com/open-mmlab/mmcv/blob/v1.5.0/mmcv/runner/hooks/logger/base.py#L137 is satisfied, and self.log(runner) is called.

As long as I guess, the reason why AssertionError: loss log variables are different across GPUs! occured instead of deadlock is due to pytorch version. I use pytorch 1.8.2 this time, and when I experienced deadlock, I used pytorch 1.12.0. I guess some different implementation of dist.all_reduce btw different pytorch version may make different result, but I'm not quite sure.

The reason why error occurs (summarization)

I agree with @Fizzez's opinion.

  1. When we set order of hook to TextLoggerHook -> MMDetWandbHook,
  2. MMDetWandbHook's reset_flag set to true
  3. on GPU 0, MMDetWandbHook's runner.log_buffer.clear_output() on LoggerHook is called.
  4. meanwhile, on GPU 1(2,3,4,...), MMDetWandbHook's runner.log_buffer.clear_output() on LoggerHook is NOT called, since MMDetWandbHook is @master_only.
  5. since GPU 1(2,3,4,...)'s runner.log_buffer is not cleared, self.log of TextLoggerHook is continuously called after first log(even when it's not timing to log)
  6. in TextLoggerHook/self.log, there exists a dist.reduce function.
  7. Since TextLoggerHook/self.log is not called (only) on GPU 0, dist.reduce makes deadlock. (deadlock on dist.reduce on GPU 1,2,3,..., and deadlock on dist.all_reduce on GPU 0)
    • But depending on pytorch version, this problem make global deadlock(on 1.12.0), or local deadlock(on 1.8.2, which leds to AssertionError: loss log variables are different across GPUs!). Detailed reason why different error occurs is not quite sure.

Solution

Solution is simple. Make MMDetWandbHook to clear runner.log_buffer on every process, including process for GPU 1,2,3...

Since I've figured out why this error occurs, I'm going to make PR for it, but since @Fizzez figured out the key reason, I'm going to mention you as a co-author. Is there no problem if I do this? I'm also going to make PR on mmseg too(if possible)

@Fizzez
Copy link

Fizzez commented Dec 11, 2022

Hi @MilkClouds , thank you for the analysis. Glad to see that we have the same option on this. Your solution actually makes more sense by letting MMDetWandbHook's runner.log_buffer.clear_output() get called.
Also thank you for mentioning me in your PR. I really appreciate it.

@hankyul2
Copy link

@MilkClouds Thank you for your solution, it saves my days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants