Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparseCategoricalCrossentropy and Mixed Precision Training #15012

Closed
zuyezheng opened this issue Jul 28, 2021 · 2 comments
Closed

SparseCategoricalCrossentropy and Mixed Precision Training #15012

zuyezheng opened this issue Jul 28, 2021 · 2 comments

Comments

@zuyezheng
Copy link

zuyezheng commented Jul 28, 2021

System information.

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): v2.5.0-0
  • Python version: 3.8
  • Bazel version (if compiling from source):
  • GPU model and memory: A6000
  • Exact command to reproduce:

tf.keras.losses.SparseCategoricalCrossentropy()

Describe the problem.

sparse_categorical_crossentropy in losses.py performs an unnecessary cast of y_true to y_pred.dtype since it's then cast to int64 in sparse_categorical_crossentropy in keras.backend.py. Eventual call to sparse_softmax_cross_entropy_with_logits in nn_ops.py is documented to expect int64 as well.

This seems to be the same code as in categorical_crossentropy, but causes issues with sparse, especially with mixed precision training and float16 as the loss in precision causes incorrect encodings or labels outside the domain resulting in incorrect or nan loss. With float16, issues start with a couple thousand labels and a couple hundred labels with bfloat16.

Describe the current behavior.

Loss of precision for labels.

Describe the expected behavior.

Cast of y_true to y_pred.dtype should be skipped.

Contributing.

  • Do you want to contribute a PR? (yes/no):
  • If yes, please read this page for instructions
  • Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue.

https://colab.research.google.com/drive/1oRbNOnCo1i2HcXD2V4_-D1Bz2EVxiT65

@rmothukuru
Copy link

@zuyezheng,
Similar issue has been raised in #15014 with other losses and a PR also has been raised. Can we close this issue so that we can track it in #15014? Thanks!

@zuyezheng
Copy link
Author

zuyezheng commented Jul 28, 2021

@rmothukuru ah thanks, looks like that one extended the findings from my original bug in the tf repo.

copybara-service bot pushed a commit that referenced this issue Jul 30, 2021
PiperOrigin-RevId: 387838298
copybara-service bot pushed a commit that referenced this issue Jul 30, 2021
PiperOrigin-RevId: 387838298
copybara-service bot pushed a commit that referenced this issue Jul 30, 2021
PiperOrigin-RevId: 387838298
copybara-service bot pushed a commit that referenced this issue Jul 30, 2021
PiperOrigin-RevId: 387844394
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants