-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss goes to NaN at 150K Iterations #8
Comments
Hummm, that's quite strange, I will look into it. BTW, you can try evaling the model weight before loss goes to NaN and report the mAP here, it can help me determine what's the problem. |
Thank you again for your quick replies! I ran the eval code using the model weights from iteration 145K (just before the loss goes to NaN) and the mAP I got was 56.06. The AP for the categories are as follows: AP for aeroplane = 0.6938 Hope this information will be useful! |
According to my learning rate decay schedule, the lr at iteration 150k should be 1e-5. It's a small value and I don't think training will break at this point. Also according to my experience of training BiDet, the network should have converged at 150k iteration so I guess the mAP would be around 66.0 before loss goes to NaN. BTW, I have to say that the training of binary neural networks especially binary detectors is very unstable. In my experiments, I have to watch its loss curve and sometimes manually adjust the learning rate if its training "breaks". The training of binary-SSD often breaks, while binary-Faster R-CNN is much more stable. One of the indicators that the training of binary-SSD breaks is that, if the cls loss (termed as 'conf' in the saved weight files) suddenly decreased largely in a few iterations (e.g. 3.55-->3.54-->3.52**-->**3.40), then we should kill the program and manually decay lr by 0.1 then continue training. Besides, the lr decay schedule in config.py is just an empirical one, I tried running the code multiple times and sometimes you need to decay earlier to prevent training from breaking. Also, if I use different PyTorch version, you may get different results. For example, I set up several conda virtual environments on one Ubuntu server and tried running the code. For BiDet-SSD on Pascal VOC, I got mAP 66.6% using PyTorch 1.5 (2020.5), mAP 65.4% using PyTorch 1.2 (2020.3), and the mAP 66.0% reported in the paper was obtained via PyTorch 1.0 (2019.11). I really don't know why, maybe just because the training of binary neural networks is too unstable and full of uncertainty. Different lr schedules even different weight initialization would cause different results. So my suggestion for you is that, maybe you can try training again and monitor the loss of BiDet-SSD. Manually decay the learning rate can ensure you a more stable results I think. |
Ah, I'm sorry I didn't see the response you post before my last comment. 56 is much lower than 66 and there should be something wrong in the training procedure. Perhaps the training "breaks" earlier before 145k iteration? Does the conf loss decrease abnormally as I described in my last comment (decrease largely in 5k iterations)? If so, then the performance of weight at 145k iteration is surely to be affected to perform badly. |
I checked and the conf_loss is relatively stable but jumps from 0.6536 to 2.2935 from 145K iteration to 150K iteration. I'll try different learning rate schedulings as suggested. |
Indeed, in order to get good performance, I'd recommend you to monitor the loss and decay lr only when training breaks in current lr (conf loss decrease rapidly). The best way is to kill the program if training breaks and re-start with a decayed lr from the weight before breaking. I guess this is because binary neural networks are easily stuck to local minimal, so the more iterations you use large lr to train, the less likely you will be stuck to local minimal and the better performance you will get (at least in the case of binary detectors). |
@killawhale2 were you able to get to ~65% accuracy in the end? It would be great to hear from folks who have managed to replicate it (so we can try to make a robust recipe). |
Through this issue, I've fixed the problem with the prior/reg loss weights as per the author's response (add 1e-6 to avoid divide by zero).
However, I noticed that my loc_loss and reg_loss became NaN.
I retried with clipping the gradients by setting the --clip_grad option as True.
My loc_loss and reg_losses still became NaN at 150k iteration and the training failed.
The exact command I ran was the following:
python ssd/train_bidet_ssd.py --dataset VOC --data_root ./data/VOCdevkit/ --basenet ./ssd/pretrain/vgg16.pth --clip_grad true
Any help would be appreciated.
The text was updated successfully, but these errors were encountered: