Feature request: streaming decoder (fast DS_IntermediateDecode calls) #1837

TheSeriousProgrammer · 2019-01-19T05:21:58Z

Can the streaming recognition service be added to deep speech client, cause currently an audio file is recorded and later it is transcribed by the engine . However most of the big STT services provide a feature of streaming realtime audio from mic and getting back results simultaneously.. That feature will in fact give a boost to the applications of this project for realtime recognition.

kdavis-mozilla · 2019-01-19T07:19:22Z

Is the vad_transcriber insufficient? If so, why?

carlfm01 · 2019-01-19T07:44:03Z

@kdavis-mozilla I think @Chidhambararajan is referring to the ability of being able to get the transcription as soon as we get enough accumulated logits to do so.
For example: I've tried to use the .NET client to stream the Windows audio output to get the transcriptions and show it on the screen like most of the subtitles works. Which is the problem with my case? We can stream the audio but we can't get the transcriptions from the stream without stoping it.
I tried to do somthing like "intermediateDecodeAndRelease" It will excecute the decoding and throw away the old logits, due to my limited knowledge in C++ it did not work :(

Related to #1757

lissyx · 2019-01-23T08:55:40Z

I think it all boils down to the fact that the decoding step is not yet streamable

lissyx · 2019-02-20T12:48:46Z

@Chidhambararajan That being said, we already have streaming for the audio feeding, and on desktop with decent CPU or a GPU it should be faster than realtime, as well as on mid-range Android smartphone with TFLite quantized model.

So you can build realtime transcription, not perfectly yet, and it should be more perfect once we have streaming decoder (soon).

reuben · 2019-05-10T22:09:14Z

Currently, the decoder we use (native_client/ctcdecode) exposes a batch API that takes a probabilities matrix and returns a list of decoded strings. The implementation is a beam search loop over all the time steps in the input probabilities. To implement a streaming decoder, one would have to refactor the decoder API from a single ctc_beam_search_decoder() call into a state-struct style API which is split into three stages: decoder_init, decoder_next, and decoder_finish or decoder_decode. At the start, you set the state for the decoder with the decoder_init() call:

DeepSpeech/native_client/ctcdecode/ctc_beam_search_decoder.cpp

Lines 26 to 46 in a4b35d2

    
           // dimension check 
        
           VALID_CHECK_EQ(class_dim, alphabet.GetSize()+1, 
        
                          "The shape of probs does not match with " 
        
                          "the shape of the vocabulary"); 
        
           // assign special ids 
        
           int space_id = alphabet.GetSpaceLabel(); 
        
           int blank_id = alphabet.GetSize(); 
        
           // init prefixes' root 
        
           PathTrie root; 
        
           root.score = root.log_prob_b_prev = 0.0; 
        
           std::vector<PathTrie *> prefixes; 
        
           prefixes.push_back(&root); 
        
           if (ext_scorer != nullptr && !ext_scorer->is_character_based()) { 
        
             auto dict_ptr = ext_scorer->dictionary->Copy(true); 
        
             root.set_dictionary(dict_ptr); 
        
             auto matcher = std::make_shared<FSTMATCH>(*dict_ptr, fst::MATCH_INPUT); 
        
             root.set_matcher(matcher); 
        
           }

Which returns a decoder state struct which contains all of the variables needed for the main loop. Then eventually you feed a batch of probabilities into the decoder with a decoder_next() step which performs N steps of the main loop over time:

DeepSpeech/native_client/ctcdecode/ctc_beam_search_decoder.cpp

Lines 48 to 121 in a4b35d2

    
           // prefix search over time 
        
           for (size_t time_step = 0; time_step < time_dim; ++time_step) { 
        
             auto *prob = &probs[time_step*class_dim]; 
        
             float min_cutoff = -NUM_FLT_INF; 
        
             bool full_beam = false; 
        
             if (ext_scorer != nullptr) { 
        
               size_t num_prefixes = std::min(prefixes.size(), beam_size); 
        
               std::sort( 
        
                   prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare); 
        
               min_cutoff = prefixes[num_prefixes - 1]->score + 
        
                            std::log(prob[blank_id]) - std::max(0.0, ext_scorer->beta); 
        
               full_beam = (num_prefixes == beam_size); 
        
             } 
        
             std::vector<std::pair<size_t, float>> log_prob_idx = 
        
                 get_pruned_log_probs(prob, class_dim, cutoff_prob, cutoff_top_n); 
        
             // loop over chars 
        
             for (size_t index = 0; index < log_prob_idx.size(); index++) { 
        
               auto c = log_prob_idx[index].first; 
        
               auto log_prob_c = log_prob_idx[index].second; 
        
               for (size_t i = 0; i < prefixes.size() && i < beam_size; ++i) { 
        
                 auto prefix = prefixes[i]; 
        
                 if (full_beam && log_prob_c + prefix->score < min_cutoff) { 
        
                   break; 
        
                 } 
        
                 // blank 
        
                 if (c == blank_id) { 
        
                   prefix->log_prob_b_cur = 
        
                       log_sum_exp(prefix->log_prob_b_cur, log_prob_c + prefix->score); 
        
                   continue; 
        
                 } 
        
                 // repeated character 
        
                 if (c == prefix->character) { 
        
                   prefix->log_prob_nb_cur = log_sum_exp( 
        
                       prefix->log_prob_nb_cur, log_prob_c + prefix->log_prob_nb_prev); 
        
                 } 
        
                 // get new prefix 
        
                 auto prefix_new = prefix->get_path_trie(c, time_step, log_prob_c); 
        
                 if (prefix_new != nullptr) { 
        
                   float log_p = -NUM_FLT_INF; 
        
                   if (c == prefix->character && 
        
                       prefix->log_prob_b_prev > -NUM_FLT_INF) { 
        
                     log_p = log_prob_c + prefix->log_prob_b_prev; 
        
                   } else if (c != prefix->character) { 
        
                     log_p = log_prob_c + prefix->score; 
        
                   } 
        
                   // language model scoring 
        
                   if (ext_scorer != nullptr && 
        
                       (c == space_id || ext_scorer->is_character_based())) { 
        
                     PathTrie *prefix_to_score = nullptr; 
        
                     // skip scoring the space 
        
                     if (ext_scorer->is_character_based()) { 
        
                       prefix_to_score = prefix_new; 
        
                     } else { 
        
                       prefix_to_score = prefix; 
        
                     } 
        
                     float score = 0.0; 
        
                     std::vector<std::string> ngram; 
        
                     ngram = ext_scorer->make_ngram(prefix_to_score); 
        
                     score = ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha; 
        
                     log_p += score; 
        
                     log_p += ext_scorer->beta; 
        
                   } 
        
                   prefix_new->log_prob_nb_cur = 
        
                       log_sum_exp(prefix_new->log_prob_nb_cur, log_p); 
        
                 } 
        
               }  // end of loop over prefix 
        
             }    // end of loop over vocabulary

And then finally you'd have a decoder_finish() or decoder_decode() step that does the final score adjustments if necessary and returns a list of decoder strings.

DeepSpeech/native_client/ctcdecode/ctc_beam_search_decoder.cpp

Lines 124 to 175 in a4b35d2

    
             prefixes.clear(); 
        
             // update log probs 
        
             root.iterate_to_vec(prefixes); 
        
             // only preserve top beam_size prefixes 
        
             if (prefixes.size() >= beam_size) { 
        
               std::nth_element(prefixes.begin(), 
        
                                prefixes.begin() + beam_size, 
        
                                prefixes.end(), 
        
                                prefix_compare); 
        
               for (size_t i = beam_size; i < prefixes.size(); ++i) { 
        
                 prefixes[i]->remove(); 
        
               } 
        
             } 
        
           }  // end of loop over time 
        
           // score the last word of each prefix that doesn't end with space 
        
           if (ext_scorer != nullptr && !ext_scorer->is_character_based()) { 
        
             for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) { 
        
               auto prefix = prefixes[i]; 
        
               if (!prefix->is_empty() && prefix->character != space_id) { 
        
                 float score = 0.0; 
        
                 std::vector<std::string> ngram = ext_scorer->make_ngram(prefix); 
        
                 score = ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha; 
        
                 score += ext_scorer->beta; 
        
                 prefix->score += score; 
        
               } 
        
             } 
        
           } 
        
           size_t num_prefixes = std::min(prefixes.size(), beam_size); 
        
           std::sort(prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare); 
        
           // compute aproximate ctc score as the return score, without affecting the 
        
           // return order of decoding result. To delete when decoder gets stable. 
        
           for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) { 
        
             double approx_ctc = prefixes[i]->score; 
        
             if (ext_scorer != nullptr) { 
        
               std::vector<int> output; 
        
               std::vector<int> timesteps; 
        
               prefixes[i]->get_path_vec(output, timesteps); 
        
               auto prefix_length = output.size(); 
        
               auto words = ext_scorer->split_labels(output); 
        
               // remove word insert 
        
               approx_ctc = approx_ctc - prefix_length * ext_scorer->beta; 
        
               // remove language model weight: 
        
               approx_ctc -= (ext_scorer->get_sent_log_prob(words)) * ext_scorer->alpha; 
        
             } 
        
             prefixes[i]->approx_ctc = approx_ctc; 
        
           } 
        
           return get_beam_search_result(prefixes, beam_size);

This step could then be called from DS_IntermediateDecode to quickly get the current decoding of the stream without having to always start from scratch. With this API in place, after a batch is computed with the acoustic model we can immediately feed the probabilities into the decoder_next() step. I'm fairly certain there are more performance gains to be had in the decoder, but this would be an amazing first step.

reuben · 2019-05-10T22:13:47Z

In the end, here's how the API would be used:

DS_SetupStream() -> decoder_init
DS_FeedAudioContent() -> ... -> StreamingState::processBatch -> decoder_next
DS_IntermediateDecode() and DS_FinishStream -> decoder_decode

dabinat · 2019-05-18T06:31:11Z

I refactored all of this and got it working. Just need to tidy it up a bit and then I'll post a PR.

dabinat · 2019-05-20T05:35:12Z

See PR #2121.

dabinat · 2019-05-22T04:44:10Z

This has now been merged with master so I'll close this issue.

lock · 2019-06-21T06:04:04Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

reuben changed the title ~~Requesting feature : Realtime transcription~~ Feature request: streaming decoder (fast DS_IntermediateDecode calls) May 10, 2019

dabinat closed this as completed May 22, 2019

lock bot locked and limited conversation to collaborators Jun 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: streaming decoder (fast DS_IntermediateDecode calls) #1837

Feature request: streaming decoder (fast DS_IntermediateDecode calls) #1837

TheSeriousProgrammer commented Jan 19, 2019

kdavis-mozilla commented Jan 19, 2019

carlfm01 commented Jan 19, 2019

lissyx commented Jan 23, 2019

lissyx commented Feb 20, 2019

reuben commented May 10, 2019 •

edited

Loading

reuben commented May 10, 2019

dabinat commented May 18, 2019

dabinat commented May 20, 2019

dabinat commented May 22, 2019

lock bot commented Jun 21, 2019

Feature request: streaming decoder (fast DS_IntermediateDecode calls) #1837

Feature request: streaming decoder (fast DS_IntermediateDecode calls) #1837

Comments

TheSeriousProgrammer commented Jan 19, 2019

kdavis-mozilla commented Jan 19, 2019

carlfm01 commented Jan 19, 2019

lissyx commented Jan 23, 2019

lissyx commented Feb 20, 2019

reuben commented May 10, 2019 • edited Loading

reuben commented May 10, 2019

dabinat commented May 18, 2019

dabinat commented May 20, 2019

dabinat commented May 22, 2019

lock bot commented Jun 21, 2019

reuben commented May 10, 2019 •

edited

Loading