-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About training problem #7
Comments
I think it is possibly due to the calculation in the loss function. The CD loss incorrectly gives a negative value in some rare cases, which will fall into nan because of sqrt. But the original code is fine. Did you change anything in the loss? |
Thank you so much for the response, I also running the original code without change the loss. Do you have any suggestions? |
I have also observed this behavior during training, in particular when using newer A40 / A100 GPUs (CUDA compute capability 8.6 / 8.0). The training loss goes down for a while, but then may start to diverge, typically observed as a relatively big leap to significantly worse parameters, and may eventually become nan. I have not done extensive experiments, but it seems that training is at least more stable on older GPUs that I have tried. No divergence after 150 epochs when training with batch size 48 on 4x T4 GPUs or 2x V100 GPUs. I will try to see if I can gain any better understanding of what is going on, but it is quite hard to know where to attack... My setup: One thing I noticed was that the PointNet++ library (pointnet2_ops) is no longer maintained by the author (Erik Wijman). Since I seem to get issues with newer hardware, I decided to give Adam Fishman's fork a shot. In particular, there is (supposedly) a fix for a potential race condition at this commit: (but I don't have any knowledge of its significance)
|
Answer your questions first: q1: Yes. We did use 2x V100 for the training. It was a mistake in the paper. q2&q3: There was an old nan problem (solved in the current version, at least in my environment). This is because of negative CD loss values, and I think it exists in other repositories using the same implementation. In some rare cases, the network output is fine, but the output of loss function becomes nan. If you go deeper, you find the CD function gives a negative value, and since we are using sqrt, it will give you nan. That's the problem. Therefore, we add a clamp to avoid the bug in CD calculation. I thought @ark1234 changed this part, which led to the problem. As you said, it can be quite unusual. If you train for fewer epochs (100 epochs), which is enough to yield a good model, it's very likely you will never find this problem. I encountered this in the late training when the losses are very close to zero. So, if the clamp is still in its old place, I think you have a new nan problem different from my case. I don't have a clear clue right now, but a possible way to debug is to track the values and find where the nan starts (could be tough). I still believe it should be somewhere close to the loss function. @lucasbrynte |
Hi, thank you so much for your great work.
However, when I running your code on PCN and Shapenet55, loss goes to nan after several epochs, it happening from A100 GPU and RTX6000 GPU. Have you ever try other devices expect V100, could you please provide some advices? Thank you so much!
The text was updated successfully, but these errors were encountered: