-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock using Wandb #8145
Comments
@linnanwang Thanks for your feedback. Could you please help me to verify it @ayulockin? My machines are all disconnected from the network, there is no way to verify. |
@linnanwang Are you referring to the |
@hhaAndroid Thanks for the quick response. I'm referring to ./configs/yolox/yolox_s_8x8_300e_coco.py. |
Hey @linnanwang thanks for raising this. So as I understand, using MMDetWandbHook is not working properly in a multi-GPU setting. Did you get any error from W&B or MMDetection that you can share? I will test the same on my machine and let you know. |
Hi @ayulockin I'm facing the same issue here. There was no bug reported but I found the GPU utilization went to 100% with 0 "GPU Time Spent Accessing Memory" (which indicated the deadlock if I understand correctly?). Then the run hung-up. It happened by the last iteration of the 1st epoch. I was wondering if this issue was related to #6486 until I came here. By the way, is the one from mmcv (https://github.com/open-mmlab/mmcv/blob/ea173c9f07f0abf6873d2b7d786fb6411843cf00/mmcv/runner/hooks/logger/wandb.py) workable with multi-GPU? |
Thanks for more info @Fizzez. Ideally MMCV's |
@ayulockin Thank you for the quick reply.
I see. I thought the If it's possible, could you please also share any ideas may point to this issue? Actually I am working on it and need a quick fix. |
@ayulockin Thank you very much for helping to check. |
Hey @linnanwang, @Fizzez, I tried training the model on 2 P100 GPUs by doing this: I couldn't reproduce the deadlock issue. If you check out the system metrics in this W&B run page, you will see that memory is allocated for both the GPUs and that both are used for training. |
perhaps it is a unique problem of ./configs/yolox/yolox_s_8x8_300e_coco.py? |
In my case I used ./configs/yolo/yolov3_d53_mstrain-608_273e_coco.py |
Perhaps this is a problem of this particular model, could you take a look at yolox_s_8x8_300e_coco? |
I experience same phenomena(deadlock over 30 minute) on |
I faced the same problem with |
In my case, I found out that putting |
Finally managed to solve this by setting i.e. use config like the following: log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook', reset_flag=True),
dict(type='MMDetWandbHook', 'Set MMDetWandbHook properly here')
]) As far as I investigated, the deadlock was caused by The reason seems to be:
p.s. I used the following environment
|
(New) Error report and analysisHi, I had different type of error but seems to occur because of same reason. I ran here mask_rcnn_r50_fpn_1x_wandb_coco.py.txt is a log text. And this error occur right after 50 iteration. (if you see error log, you can view I can check console output below when I inserted
We can check that As long as I guess, the reason why The reason why error occurs (summarization)I agree with @Fizzez's opinion.
SolutionSolution is simple. Make Since I've figured out why this error occurs, I'm going to make PR for it, but since @Fizzez figured out the key reason, I'm going to mention you as a co-author. Is there no problem if I do this? I'm also going to make PR on mmseg too(if possible) |
Hi @MilkClouds , thank you for the analysis. Glad to see that we have the same option on this. Your solution actually makes more sense by letting |
@MilkClouds Thank you for your solution, it saves my days. |
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug
Hello mmdet developers,
We found the training loop can be dead lock in some places if we use multiGPU training and enable wandb tracking. Single GPU works perfectly fine. I only tested with YOLOX. Please see the command below.
Reproduction
No
MSCOCO
Environment
python mmdet/utils/collect_env.py
to collect necessary environment information and paste it here.sys.platform: linux
Python: 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0]
CUDA available: True
GPU 0,1: Quadro GV100
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.3.r11.3/compiler.29745058_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
TorchVision: 0.11.0
OpenCV: 4.5.5
MMCV: 1.4.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.25.0+ca11860
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)We used the provided docker.
Error traceback
If applicable, paste the error trackback here.
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
The text was updated successfully, but these errors were encountered: