the loss tend to be nan #5

txiaoyun · 2019-01-19T08:00:59Z

When I run the code, the train loss is nan. Could you give me some advice? Thank you.

xiaomingdaren123 · 2019-03-08T03:52:24Z

@txiaoyun @d-acharya
I apply the patch suggested in tensorflow_patch.txt file,but the train_loss is still nan，Could you give me some advice? Thank you.

txiaoyun · 2019-03-13T04:37:55Z

@xiaomingdaren123
The loss is tend to nan is caused by the decomposition of the eigenvalue.

you can reduce the learning rate;
you can add a clip in covpoolnet.py, as follows:

def _cal_log_cov(features):
[s_f, v_f] = tf.self_adjoint_eig(features)
s_f = tf.clip_by_value(s_f, 0.0001, 10000)
s_f = tf.log(s_f)
s_f = tf.matrix_diag(s_f)
features_t = tf.matmul(tf.matmul(v_f, s_f), tf.transpose(v_f, [0, 2, 1]))
return features_t

But loss will grow bigger and fall again and again, also I did not reproduce the author's results.

d-acharya · 2019-05-11T08:02:23Z

The gradient computation of tensorflow eigen decomposition is most likely producing NaNs. The proposed technique tensorflow_patch.txt worked previously on a different system (with occasional failures). Recently I tried it on a different system and it consistently produced NaNs too (on tensorflow 1.13 it produces nan after few epochs, where as in tensorflow 1.2 it produces nans after around 600 epochs). I will check if changing regularization and learning rate will avoid this. I will try to check this and update. Clipping is alternative solution and was actually used to train model4 and model2 mentioned in paper. However training again, I myself am unable to get same exact numbers.

d-acharya · 2019-05-11T08:11:55Z

However, if you cannot get the numbers in the paper by using pretrained models, I would try following data: https://drive.google.com/open?id=1eh93I0ndg6X-liUJDYpWveIShLd0ao_x
and make sure following versions are used:
scikit-learn==0.18.1
tensorflow==1.2.0
numpy==1.14.4
Pillow==4.3.0
python 2.7

Different version of pickle or classifier was found to effect reported numbers.

YileYile · 2019-09-09T03:55:51Z

@d-acharya @txiaoyun @xiaomingdaren123

I didn't apply the patch suggested in tensorflow_patch. And what I use is python 3.5.

The “Loss” value has been floating in a small range, neither increasing nor decreasing.
The “RegLoss” value remained constant at 0.

Could you give me some advice? Thank you.

fredlll · 2019-09-20T06:07:26Z

@txiaoyun @xiaomingdaren123 @d-acharya @YileYile
I am facing the same problem. The loss becomes NAN after 10 epochs. Did anyone find the solution?
thx

PR1706 · 2019-09-29T01:17:50Z

@txiaoyun @xiaomingdaren123 @YileYile @fredlll
You can change the version of tensorflow and related libraries that the author said. Then you can not get NaN.

dyt0414 · 2020-11-23T13:11:05Z

Hi , how to solve the problem about 'without dlpcnn'?

d-acharya closed this as completed May 11, 2019

d-acharya reopened this May 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the loss tend to be nan #5

the loss tend to be nan #5

txiaoyun commented Jan 19, 2019

xiaomingdaren123 commented Mar 8, 2019

txiaoyun commented Mar 13, 2019

d-acharya commented May 11, 2019 •

edited

Loading

d-acharya commented May 11, 2019 •

edited

Loading

YileYile commented Sep 9, 2019 •

edited

Loading

fredlll commented Sep 20, 2019

PR1706 commented Sep 29, 2019

dyt0414 commented Nov 23, 2020

the loss tend to be nan #5

the loss tend to be nan #5

Comments

txiaoyun commented Jan 19, 2019

xiaomingdaren123 commented Mar 8, 2019

txiaoyun commented Mar 13, 2019

d-acharya commented May 11, 2019 • edited Loading

d-acharya commented May 11, 2019 • edited Loading

YileYile commented Sep 9, 2019 • edited Loading

fredlll commented Sep 20, 2019

PR1706 commented Sep 29, 2019

dyt0414 commented Nov 23, 2020

d-acharya commented May 11, 2019 •

edited

Loading

d-acharya commented May 11, 2019 •

edited

Loading

YileYile commented Sep 9, 2019 •

edited

Loading