Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the loss tend to be nan #5

Open
txiaoyun opened this issue Jan 19, 2019 · 8 comments
Open

the loss tend to be nan #5

txiaoyun opened this issue Jan 19, 2019 · 8 comments

Comments

@txiaoyun
Copy link

When I run the code, the train loss is nan. Could you give me some advice? Thank you.

@xiaomingdaren123
Copy link

@txiaoyun @d-acharya
I apply the patch suggested in tensorflow_patch.txt file,but the train_loss is still nan,Could you give me some advice? Thank you.

@txiaoyun
Copy link
Author

@xiaomingdaren123
The loss is tend to nan is caused by the decomposition of the eigenvalue.

  1. you can reduce the learning rate;
  2. you can add a clip in covpoolnet.py, as follows:

def _cal_log_cov(features):
[s_f, v_f] = tf.self_adjoint_eig(features)
s_f = tf.clip_by_value(s_f, 0.0001, 10000)
s_f = tf.log(s_f)
s_f = tf.matrix_diag(s_f)
features_t = tf.matmul(tf.matmul(v_f, s_f), tf.transpose(v_f, [0, 2, 1]))
return features_t

But loss will grow bigger and fall again and again, also I did not reproduce the author's results.

@d-acharya
Copy link
Owner

d-acharya commented May 11, 2019

The gradient computation of tensorflow eigen decomposition is most likely producing NaNs. The proposed technique tensorflow_patch.txt worked previously on a different system (with occasional failures). Recently I tried it on a different system and it consistently produced NaNs too (on tensorflow 1.13 it produces nan after few epochs, where as in tensorflow 1.2 it produces nans after around 600 epochs). I will check if changing regularization and learning rate will avoid this. I will try to check this and update. Clipping is alternative solution and was actually used to train model4 and model2 mentioned in paper. However training again, I myself am unable to get same exact numbers.

@d-acharya d-acharya reopened this May 11, 2019
@d-acharya
Copy link
Owner

d-acharya commented May 11, 2019

However, if you cannot get the numbers in the paper by using pretrained models, I would try following data: https://drive.google.com/open?id=1eh93I0ndg6X-liUJDYpWveIShLd0ao_x
and make sure following versions are used:
scikit-learn==0.18.1
tensorflow==1.2.0
numpy==1.14.4
Pillow==4.3.0
python 2.7

Different version of pickle or classifier was found to effect reported numbers.

@YileYile
Copy link

YileYile commented Sep 9, 2019

@d-acharya @txiaoyun @xiaomingdaren123

I didn't apply the patch suggested in tensorflow_patch. And what I use is python 3.5.

  1. The “Loss” value has been floating in a small range, neither increasing nor decreasing.
  2. The “RegLoss” value remained constant at 0.

Could you give me some advice? Thank you.

@fredlll
Copy link

fredlll commented Sep 20, 2019

@txiaoyun @xiaomingdaren123 @d-acharya @YileYile
I am facing the same problem. The loss becomes NAN after 10 epochs. Did anyone find the solution?
thx

@PR1706
Copy link

PR1706 commented Sep 29, 2019

@txiaoyun @xiaomingdaren123 @YileYile @fredlll
You can change the version of tensorflow and related libraries that the author said. Then you can not get NaN.

@dyt0414
Copy link

dyt0414 commented Nov 23, 2020

Hi , how to solve the problem about 'without dlpcnn'?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants