Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About training problem #7

Open
ark1234 opened this issue Nov 24, 2022 · 4 comments
Open

About training problem #7

ark1234 opened this issue Nov 24, 2022 · 4 comments

Comments

@ark1234
Copy link

ark1234 commented Nov 24, 2022

Hi, thank you so much for your great work.

However, when I running your code on PCN and Shapenet55, loss goes to nan after several epochs, it happening from A100 GPU and RTX6000 GPU. Have you ever try other devices expect V100, could you please provide some advices? Thank you so much!

@hrzhou2
Copy link
Owner

hrzhou2 commented Nov 27, 2022

I think it is possibly due to the calculation in the loss function. The CD loss incorrectly gives a negative value in some rare cases, which will fall into nan because of sqrt.

But the original code is fine. Did you change anything in the loss?

@ark1234
Copy link
Author

ark1234 commented Nov 28, 2022

Thank you so much for the response, I also running the original code without change the loss. Do you have any suggestions?

@lucasbrynte
Copy link

I have also observed this behavior during training, in particular when using newer A40 / A100 GPUs (CUDA compute capability 8.6 / 8.0). The training loss goes down for a while, but then may start to diverge, typically observed as a relatively big leap to significantly worse parameters, and may eventually become nan. I have not done extensive experiments, but it seems that training is at least more stable on older GPUs that I have tried. No divergence after 150 epochs when training with batch size 48 on 4x T4 GPUs or 2x V100 GPUs.

I will try to see if I can gain any better understanding of what is going on, but it is quite hard to know where to attack...

My setup:
I am running an Ubuntu 22.04 container with Pytorch 1.10 installed with conda (mamba) along with cudatoolkit=11.3. I also installed the apt package nvidia-cuda-toolkit in the container, which provided nvcc (CUDA version 11.5). When using nvcc to build the CUDA extensions, I do however get a mild warning of a minor CUDA version mismatch, since my nvcc installation uses CUDA 11.5, while Pytorch was compiled with CUDA 11.3. It is also states that most likely this will not be an issue.

One thing I noticed was that the PointNet++ library (pointnet2_ops) is no longer maintained by the author (Erik Wijman). Since I seem to get issues with newer hardware, I decided to give Adam Fishman's fork a shot. In particular, there is (supposedly) a fix for a potential race condition at this commit: (but I don't have any knowledge of its significance)
fishbotics/pointnet2_ops@8a8858f
Sadly, in my experiments, training with this fork was less stable than with the version shipped with SeedFormer.

@hrzhou2

  • Did you indeed use V100 GPUs for the training, as @ark1234 is suggesting? The reason I'm asking is that the paper seems to report training on two Titan Xp GPUs. But as they have only 12 GB memory each, from my own experiments I doubt that training would be possible with batch size 48... If the training was indeed carried out on 2x V100 GPUs (each with the 32 GB configuration), a batch of 48 samples fits just fine.
  • Negative Chamfer distance sounds very weird. I am running an experiment now, training a model from scratch, with an assertion on positive CD. So far, after 19 epochs, the Chamfer distances have always remained >= 0. Perhaps it is more likely to happen when the distances get closer to 0, e.g. for the pretrained model? Do you recall in what situation this could potentially happen?
  • Even so, I am not convinced that negative Chamfer distances would be the source of the training instability, since the distance is clamped at 1e-9 before the square-root. Btw, why not clamped at 0?

@hrzhou2
Copy link
Owner

hrzhou2 commented Dec 21, 2022

Answer your questions first:

q1: Yes. We did use 2x V100 for the training. It was a mistake in the paper.

q2&q3: There was an old nan problem (solved in the current version, at least in my environment). This is because of negative CD loss values, and I think it exists in other repositories using the same implementation. In some rare cases, the network output is fine, but the output of loss function becomes nan. If you go deeper, you find the CD function gives a negative value, and since we are using sqrt, it will give you nan. That's the problem. Therefore, we add a clamp to avoid the bug in CD calculation. I thought @ark1234 changed this part, which led to the problem.

As you said, it can be quite unusual. If you train for fewer epochs (100 epochs), which is enough to yield a good model, it's very likely you will never find this problem. I encountered this in the late training when the losses are very close to zero.

So, if the clamp is still in its old place, I think you have a new nan problem different from my case. I don't have a clear clue right now, but a possible way to debug is to track the values and find where the nan starts (could be tough). I still believe it should be somewhere close to the loss function.

@lucasbrynte
However, for your own nan problem, I think that's another case. If you just start training and the loss doesn't go down smoothly (it bursts and goes very high until nan), I think it's just the network isn't trained very well. Maybe due to too large learning rate or batch size. You could just start again, and the default training should be diverging smoothly even after 200 epochs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants