Implement input pipeline with tf.data API #1919

reuben · 2019-02-28T20:18:29Z

Creating a PR to run tests and see how this does in its current state.

reuben

Some inline comments to ease reviewing.

util/config.py

native_client/BUILD

native_client/deepspeech.cc

util/feeding.py

reuben · 2019-03-15T23:35:11Z

This code should not change much between now and landing, so reviews can start already. This PR needs the r1.13 update to be merged, and since that's going slowly I'm asking for reviews already to speed things up. I'm not super happy with the way evaluate.py turned out, so that file may change a bit, but I won't force push until reviews are done.

reuben · 2019-03-15T23:43:19Z

(Requesting review now but I don't expect you to do it during the weekend! :P)

kdavis-mozilla · 2019-03-18T09:52:15Z

Seems like the training run 4290 that's testing this is having some problems. It looks like all training epochs have inf loss through the validation losses look reasonable.

reuben · 2019-03-18T10:29:02Z

I'm gonna check if that problem is due to the data, this tf.data code, or cudnn_rnn. I suspect it's unrelated to this PR.

util/feeding.py

reuben · 2019-03-18T21:05:32Z

I'm gonna check if that problem is due to the data, this tf.data code, or cudnn_rnn. I suspect it's unrelated to this PR.

Job 4298 running into the same problem, without the cudnn_rnn changes. So it's either this code or the data...

kdavis-mozilla

In evaluate.py there are some unused imports that should be removed. There are also a few unused imports a few in other files too.

Other than that there are a few nits here and there, but nothing too important.

native_client/BUILD

native_client/deepspeech.cc

requirements.txt

evaluate.py

DeepSpeech.py

reuben · 2019-03-20T15:32:13Z

Rebased on top of r1.13 changes in master and addressed review comments in 767e0b1.

reuben · 2019-03-21T22:23:09Z

Two samples are causing the inf training loss problem:

wav_filename	transcript
fisher-2004-split-wav/fe_03_00250-585.19-585.91.wav	yeah i didn't like it much actually
fisher-2005-split-wav/fe_03_05882-158.6-159.56.wav	you know what would you do if you had a million

Before this PR, the corresponding audio files would create 36 and 48 time steps, respectively. Because of different rounding, the new feature computation generates 35 and 47 time steps instead, which is also exactly the length of their transcripts. The TF AudioSpectrogram op only creates full windows, dropping samples if needed, whereas python_speech_features uses the entire audio by padding the last window if needed.

You can reproduce the same problem but with the code on master by downloading the audio files and training with the following train CSV and --n_hidden 2048:

wav_filename,wav_filesize,transcript
fe_03_00250-585.19-585.91.wav,0,yeah i didn't like it much actuallyy
fe_03_05882-158.6-159.56.wav,1,you know what would you do if you had a millionn

Note that the last letter in the transcript is repeated to bring it up to the same length as the audio features. Something about these files plus a transcript that is big enough makes the model choke on it.

Some possible fixes for this:

Filter the data on len(features) > len(transcript). (Note >, not >=). This is unfortunate because it's not a real requirement, it's just these particular samples that cause trouble, and it could remove good data.
Ignore inf loss when reducing the mean loss over samples and just pretend the sample doesn't exist. This will fix this problem now but can hide bigger issues in the future.
Remove those two files from the dataset.

I'm tempted towards 3 because it's the easiest without collateral damage. @kdavis-mozilla thoughts?

kdavis-mozilla · 2019-03-22T08:40:15Z

@reuben I'd say 3 too.

All comments addressed.

…sion

…c a bit

reuben · 2019-04-03T13:57:41Z

@lissyx could you take a look at the test changes I had to make to handle the versioning code?

@kdavis-mozilla things are working and tests are green, so this PR should not change going forward, and I will not force push anymore, so feel free to review.

kdavis-mozilla

LGTM

DeepSpeech.py

util/flags.py

tilmankamp · 2019-04-04T10:59:58Z

DeepSpeech.py

+                            save_path = best_dev_saver.save(session, best_dev_path, latest_filename=best_dev_filename)
+                            log_info("Saved new best validating model with loss %f to: %s" % (best_dev_loss, save_path))
+
+                        # Early stopping


Adding it here, as there is no better place: You should add some wording to the early stopping flags about the requirement of having validation activated (as you made it optional).

The documentation for the early_stop flag already mentions it's tied to validation, but I'll try to clarify it.

…it flags

lock · 2019-05-05T04:01:34Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

reuben force-pushed the tfdatatest branch from c215f48 to c852493 Compare February 28, 2019 20:33

reuben force-pushed the tfdatatest branch from c852493 to 9f89e2f Compare March 15, 2019 23:22

reuben commented Mar 15, 2019

View reviewed changes

reuben requested review from kdavis-mozilla and tilmankamp March 15, 2019 23:42

tilmankamp reviewed Mar 18, 2019

View reviewed changes

util/feeding.py Show resolved Hide resolved

kdavis-mozilla previously requested changes Mar 19, 2019

View reviewed changes

mozilla deleted a comment from kdavis-mozilla Mar 19, 2019

reuben commented Mar 19, 2019

View reviewed changes

DeepSpeech.py Show resolved Hide resolved

reuben changed the title ~~WIP implement input pipeline with tf.data API~~ Implement input pipeline with tf.data API Mar 20, 2019

reuben force-pushed the tfdatatest branch 2 times, most recently from 8af6d36 to 767e0b1 Compare March 20, 2019 15:30

reuben force-pushed the tfdatatest branch from 985f8f4 to 2e78c17 Compare March 22, 2019 13:43

reuben force-pushed the tfdatatest branch 3 times, most recently from e1d5df6 to 28e8274 Compare April 2, 2019 13:46

reuben requested a review from kdavis-mozilla April 2, 2019 14:42

reuben added 4 commits April 2, 2019 18:31

Rewrite input pipeline to use tf.data API

1cea2b0

Remove DS_AudioToInputVector and dep on c_speech_features

51f8074

Remove c_speech_features and kiss_fft130 code

12fe93b

Fix illegal summary names

e7bbd4a

reuben added 3 commits April 2, 2019 18:31

Don't overwrite exported graph from training task with the TFLite ver…

6632504

…sion

Speed up training tests and make sure they fully converge

d6babfb

Infer number of MFCC features from input shape

4e9e78f

reuben force-pushed the tfdatatest branch 3 times, most recently from dc0a3ec to a0dc0fd Compare April 3, 2019 00:04

Add version info to exported graphs

a7cda8e

reuben force-pushed the tfdatatest branch 2 times, most recently from 5ee4daf to 2917227 Compare April 3, 2019 12:47

Fix TFLite bug in feature computation graph and clean up deepspeech.c…

232df74

…c a bit

reuben force-pushed the tfdatatest branch from 2917227 to 232df74 Compare April 3, 2019 13:19

reuben requested a review from lissyx April 3, 2019 13:56

kdavis-mozilla approved these changes Apr 4, 2019

View reviewed changes

tilmankamp reviewed Apr 4, 2019

View reviewed changes

DeepSpeech.py Show resolved Hide resolved

tilmankamp reviewed Apr 4, 2019

View reviewed changes

DeepSpeech.py Show resolved Hide resolved

tilmankamp reviewed Apr 4, 2019

View reviewed changes

util/flags.py Outdated Show resolved Hide resolved

tilmankamp reviewed Apr 4, 2019

View reviewed changes

reuben removed the request for review from lissyx April 4, 2019 21:36

reuben added 3 commits April 4, 2019 22:41

Check if train/dev/test files were passed in instead of having explic…

ed15caf

…it flags

Use TASKCLUSTER_TMP_DIR instead of hardcoding /tmp

d70753c

Clarify early stopping dependency on validation

5ee856d

reuben force-pushed the tfdatatest branch from 7418a1b to 5ee856d Compare April 5, 2019 01:41

Pass missing dropout rate parameters

6154150

reuben merged commit 5745089 into master Apr 5, 2019

reuben deleted the tfdatatest branch April 5, 2019 03:13

reuben mentioned this pull request Apr 5, 2019

Embed metadata in exported model #1770

Closed

lock bot locked and limited conversation to collaborators May 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement input pipeline with tf.data API #1919

Implement input pipeline with tf.data API #1919

reuben commented Feb 28, 2019

reuben left a comment

reuben commented Mar 15, 2019

reuben commented Mar 15, 2019

kdavis-mozilla commented Mar 18, 2019

reuben commented Mar 18, 2019

reuben commented Mar 18, 2019

kdavis-mozilla left a comment

reuben commented Mar 20, 2019

reuben commented Mar 21, 2019 •

edited

Loading

kdavis-mozilla commented Mar 22, 2019 •

edited

Loading

reuben commented Apr 3, 2019

kdavis-mozilla left a comment

tilmankamp Apr 4, 2019

reuben Apr 4, 2019

lock bot commented May 5, 2019

Implement input pipeline with tf.data API #1919

Implement input pipeline with tf.data API #1919

Conversation

reuben commented Feb 28, 2019

reuben left a comment

Choose a reason for hiding this comment

reuben commented Mar 15, 2019

reuben commented Mar 15, 2019

kdavis-mozilla commented Mar 18, 2019

reuben commented Mar 18, 2019

reuben commented Mar 18, 2019

kdavis-mozilla left a comment

Choose a reason for hiding this comment

reuben commented Mar 20, 2019

reuben commented Mar 21, 2019 • edited Loading

kdavis-mozilla commented Mar 22, 2019 • edited Loading

reuben commented Apr 3, 2019

kdavis-mozilla left a comment

Choose a reason for hiding this comment

tilmankamp Apr 4, 2019

Choose a reason for hiding this comment

reuben Apr 4, 2019

Choose a reason for hiding this comment

lock bot commented May 5, 2019

reuben commented Mar 21, 2019 •

edited

Loading

kdavis-mozilla commented Mar 22, 2019 •

edited

Loading