-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement input pipeline with tf.data API #1919
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some inline comments to ease reviewing.
This code should not change much between now and landing, so reviews can start already. This PR needs the r1.13 update to be merged, and since that's going slowly I'm asking for reviews already to speed things up. I'm not super happy with the way |
(Requesting review now but I don't expect you to do it during the weekend! :P) |
Seems like the training run 4290 that's testing this is having some problems. It looks like all training epochs have inf loss through the validation losses look reasonable. |
I'm gonna check if that problem is due to the data, this tf.data code, or cudnn_rnn. I suspect it's unrelated to this PR. |
Job 4298 running into the same problem, without the cudnn_rnn changes. So it's either this code or the data... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In evaluate.py there are some unused imports that should be removed. There are also a few unused imports a few in other files too.
Other than that there are a few nits here and there, but nothing too important.
8af6d36
to
767e0b1
Compare
Rebased on top of r1.13 changes in master and addressed review comments in 767e0b1. |
Two samples are causing the
Before this PR, the corresponding audio files would create 36 and 48 time steps, respectively. Because of different rounding, the new feature computation generates 35 and 47 time steps instead, which is also exactly the length of their transcripts. The TF AudioSpectrogram op only creates full windows, dropping samples if needed, whereas You can reproduce the same problem but with the code on master by downloading the audio files and training with the following train CSV and
Note that the last letter in the transcript is repeated to bring it up to the same length as the audio features. Something about these files plus a transcript that is big enough makes the model choke on it. Some possible fixes for this:
I'm tempted towards 3 because it's the easiest without collateral damage. @kdavis-mozilla thoughts? |
@reuben I'd say 3 too. |
e1d5df6
to
28e8274
Compare
dc0a3ec
to
a0dc0fd
Compare
5ee4daf
to
2917227
Compare
@lissyx could you take a look at the test changes I had to make to handle the versioning code? @kdavis-mozilla things are working and tests are green, so this PR should not change going forward, and I will not force push anymore, so feel free to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
save_path = best_dev_saver.save(session, best_dev_path, latest_filename=best_dev_filename) | ||
log_info("Saved new best validating model with loss %f to: %s" % (best_dev_loss, save_path)) | ||
|
||
# Early stopping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding it here, as there is no better place: You should add some wording to the early stopping flags about the requirement of having validation activated (as you made it optional).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for the early_stop flag already mentions it's tied to validation, but I'll try to clarify it.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Creating a PR to run tests and see how this does in its current state.