Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement input pipeline with tf.data API #1919

Merged
merged 13 commits into from
Apr 5, 2019
Merged

Implement input pipeline with tf.data API #1919

merged 13 commits into from
Apr 5, 2019

Conversation

reuben
Copy link
Contributor

@reuben reuben commented Feb 28, 2019

Creating a PR to run tests and see how this does in its current state.

Copy link
Contributor Author

@reuben reuben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some inline comments to ease reviewing.

util/config.py Outdated Show resolved Hide resolved
native_client/BUILD Outdated Show resolved Hide resolved
native_client/BUILD Outdated Show resolved Hide resolved
native_client/BUILD Outdated Show resolved Hide resolved
native_client/deepspeech.cc Show resolved Hide resolved
native_client/deepspeech.cc Show resolved Hide resolved
util/feeding.py Show resolved Hide resolved
util/feeding.py Show resolved Hide resolved
@reuben
Copy link
Contributor Author

reuben commented Mar 15, 2019

This code should not change much between now and landing, so reviews can start already. This PR needs the r1.13 update to be merged, and since that's going slowly I'm asking for reviews already to speed things up. I'm not super happy with the way evaluate.py turned out, so that file may change a bit, but I won't force push until reviews are done.

@reuben
Copy link
Contributor Author

reuben commented Mar 15, 2019

(Requesting review now but I don't expect you to do it during the weekend! :P)

@kdavis-mozilla
Copy link
Contributor

Seems like the training run 4290 that's testing this is having some problems. It looks like all training epochs have inf loss through the validation losses look reasonable.

@reuben
Copy link
Contributor Author

reuben commented Mar 18, 2019

I'm gonna check if that problem is due to the data, this tf.data code, or cudnn_rnn. I suspect it's unrelated to this PR.

@reuben
Copy link
Contributor Author

reuben commented Mar 18, 2019

I'm gonna check if that problem is due to the data, this tf.data code, or cudnn_rnn. I suspect it's unrelated to this PR.

Job 4298 running into the same problem, without the cudnn_rnn changes. So it's either this code or the data...

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In evaluate.py there are some unused imports that should be removed. There are also a few unused imports a few in other files too.

Other than that there are a few nits here and there, but nothing too important.

native_client/BUILD Outdated Show resolved Hide resolved
native_client/BUILD Outdated Show resolved Hide resolved
native_client/BUILD Outdated Show resolved Hide resolved
native_client/deepspeech.cc Show resolved Hide resolved
requirements.txt Show resolved Hide resolved
evaluate.py Show resolved Hide resolved
evaluate.py Show resolved Hide resolved
DeepSpeech.py Outdated Show resolved Hide resolved
DeepSpeech.py Show resolved Hide resolved
DeepSpeech.py Show resolved Hide resolved
@mozilla mozilla deleted a comment from kdavis-mozilla Mar 19, 2019
DeepSpeech.py Show resolved Hide resolved
@reuben reuben changed the title WIP implement input pipeline with tf.data API Implement input pipeline with tf.data API Mar 20, 2019
@reuben reuben force-pushed the tfdatatest branch 2 times, most recently from 8af6d36 to 767e0b1 Compare March 20, 2019 15:30
@reuben
Copy link
Contributor Author

reuben commented Mar 20, 2019

Rebased on top of r1.13 changes in master and addressed review comments in 767e0b1.

@reuben
Copy link
Contributor Author

reuben commented Mar 21, 2019

Two samples are causing the inf training loss problem:

wav_filename transcript
fisher-2004-split-wav/fe_03_00250-585.19-585.91.wav yeah i didn't like it much actually
fisher-2005-split-wav/fe_03_05882-158.6-159.56.wav you know what would you do if you had a million

Before this PR, the corresponding audio files would create 36 and 48 time steps, respectively. Because of different rounding, the new feature computation generates 35 and 47 time steps instead, which is also exactly the length of their transcripts. The TF AudioSpectrogram op only creates full windows, dropping samples if needed, whereas python_speech_features uses the entire audio by padding the last window if needed.

You can reproduce the same problem but with the code on master by downloading the audio files and training with the following train CSV and --n_hidden 2048:

wav_filename,wav_filesize,transcript
fe_03_00250-585.19-585.91.wav,0,yeah i didn't like it much actuallyy
fe_03_05882-158.6-159.56.wav,1,you know what would you do if you had a millionn

Note that the last letter in the transcript is repeated to bring it up to the same length as the audio features. Something about these files plus a transcript that is big enough makes the model choke on it.

Some possible fixes for this:

  1. Filter the data on len(features) > len(transcript). (Note >, not >=). This is unfortunate because it's not a real requirement, it's just these particular samples that cause trouble, and it could remove good data.
  2. Ignore inf loss when reducing the mean loss over samples and just pretend the sample doesn't exist. This will fix this problem now but can hide bigger issues in the future.
  3. Remove those two files from the dataset.

I'm tempted towards 3 because it's the easiest without collateral damage. @kdavis-mozilla thoughts?

@kdavis-mozilla
Copy link
Contributor

kdavis-mozilla commented Mar 22, 2019

@reuben I'd say 3 too.

@reuben reuben force-pushed the tfdatatest branch 3 times, most recently from e1d5df6 to 28e8274 Compare April 2, 2019 13:46
@reuben reuben requested a review from kdavis-mozilla April 2, 2019 14:42
@reuben reuben dismissed kdavis-mozilla’s stale review April 2, 2019 14:43

All comments addressed.

@reuben reuben force-pushed the tfdatatest branch 3 times, most recently from dc0a3ec to a0dc0fd Compare April 3, 2019 00:04
@reuben reuben force-pushed the tfdatatest branch 2 times, most recently from 5ee4daf to 2917227 Compare April 3, 2019 12:47
@reuben reuben requested a review from lissyx April 3, 2019 13:56
@reuben
Copy link
Contributor Author

reuben commented Apr 3, 2019

@lissyx could you take a look at the test changes I had to make to handle the versioning code?

@kdavis-mozilla things are working and tests are green, so this PR should not change going forward, and I will not force push anymore, so feel free to review.

Copy link
Contributor

@kdavis-mozilla kdavis-mozilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

util/flags.py Outdated Show resolved Hide resolved
save_path = best_dev_saver.save(session, best_dev_path, latest_filename=best_dev_filename)
log_info("Saved new best validating model with loss %f to: %s" % (best_dev_loss, save_path))

# Early stopping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding it here, as there is no better place: You should add some wording to the early stopping flags about the requirement of having validation activated (as you made it optional).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for the early_stop flag already mentions it's tied to validation, but I'll try to clarify it.

@reuben reuben removed the request for review from lissyx April 4, 2019 21:36
@reuben reuben merged commit 5745089 into master Apr 5, 2019
@reuben reuben deleted the tfdatatest branch April 5, 2019 03:13
@lock
Copy link

lock bot commented May 5, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators May 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants