Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: streaming decoder (fast DS_IntermediateDecode calls) #1837

Closed
TheSeriousProgrammer opened this issue Jan 19, 2019 · 10 comments

Comments

@TheSeriousProgrammer
Copy link

Can the streaming recognition service be added to deep speech client, cause currently an audio file is recorded and later it is transcribed by the engine . However most of the big STT services provide a feature of streaming realtime audio from mic and getting back results simultaneously.. That feature will in fact give a boost to the applications of this project for realtime recognition.

@kdavis-mozilla
Copy link
Contributor

Is the vad_transcriber insufficient? If so, why?

@carlfm01
Copy link
Collaborator

@kdavis-mozilla I think @Chidhambararajan is referring to the ability of being able to get the transcription as soon as we get enough accumulated logits to do so.
For example: I've tried to use the .NET client to stream the Windows audio output to get the transcriptions and show it on the screen like most of the subtitles works. Which is the problem with my case? We can stream the audio but we can't get the transcriptions from the stream without stoping it.
I tried to do somthing like "intermediateDecodeAndRelease" It will excecute the decoding and throw away the old logits, due to my limited knowledge in C++ it did not work :(

Related to #1757

@lissyx
Copy link
Collaborator

lissyx commented Jan 23, 2019

I think it all boils down to the fact that the decoding step is not yet streamable

@lissyx
Copy link
Collaborator

lissyx commented Feb 20, 2019

@Chidhambararajan That being said, we already have streaming for the audio feeding, and on desktop with decent CPU or a GPU it should be faster than realtime, as well as on mid-range Android smartphone with TFLite quantized model.

So you can build realtime transcription, not perfectly yet, and it should be more perfect once we have streaming decoder (soon).

@reuben reuben changed the title Requesting feature : Realtime transcription Feature request: streaming decoder (fast DS_IntermediateDecode calls) May 10, 2019
@reuben
Copy link
Contributor

reuben commented May 10, 2019

Currently, the decoder we use (native_client/ctcdecode) exposes a batch API that takes a probabilities matrix and returns a list of decoded strings. The implementation is a beam search loop over all the time steps in the input probabilities. To implement a streaming decoder, one would have to refactor the decoder API from a single ctc_beam_search_decoder() call into a state-struct style API which is split into three stages: decoder_init, decoder_next, and decoder_finish or decoder_decode. At the start, you set the state for the decoder with the decoder_init() call:

// dimension check
VALID_CHECK_EQ(class_dim, alphabet.GetSize()+1,
"The shape of probs does not match with "
"the shape of the vocabulary");
// assign special ids
int space_id = alphabet.GetSpaceLabel();
int blank_id = alphabet.GetSize();
// init prefixes' root
PathTrie root;
root.score = root.log_prob_b_prev = 0.0;
std::vector<PathTrie *> prefixes;
prefixes.push_back(&root);
if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
auto dict_ptr = ext_scorer->dictionary->Copy(true);
root.set_dictionary(dict_ptr);
auto matcher = std::make_shared<FSTMATCH>(*dict_ptr, fst::MATCH_INPUT);
root.set_matcher(matcher);
}

Which returns a decoder state struct which contains all of the variables needed for the main loop. Then eventually you feed a batch of probabilities into the decoder with a decoder_next() step which performs N steps of the main loop over time:

// prefix search over time
for (size_t time_step = 0; time_step < time_dim; ++time_step) {
auto *prob = &probs[time_step*class_dim];
float min_cutoff = -NUM_FLT_INF;
bool full_beam = false;
if (ext_scorer != nullptr) {
size_t num_prefixes = std::min(prefixes.size(), beam_size);
std::sort(
prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
min_cutoff = prefixes[num_prefixes - 1]->score +
std::log(prob[blank_id]) - std::max(0.0, ext_scorer->beta);
full_beam = (num_prefixes == beam_size);
}
std::vector<std::pair<size_t, float>> log_prob_idx =
get_pruned_log_probs(prob, class_dim, cutoff_prob, cutoff_top_n);
// loop over chars
for (size_t index = 0; index < log_prob_idx.size(); index++) {
auto c = log_prob_idx[index].first;
auto log_prob_c = log_prob_idx[index].second;
for (size_t i = 0; i < prefixes.size() && i < beam_size; ++i) {
auto prefix = prefixes[i];
if (full_beam && log_prob_c + prefix->score < min_cutoff) {
break;
}
// blank
if (c == blank_id) {
prefix->log_prob_b_cur =
log_sum_exp(prefix->log_prob_b_cur, log_prob_c + prefix->score);
continue;
}
// repeated character
if (c == prefix->character) {
prefix->log_prob_nb_cur = log_sum_exp(
prefix->log_prob_nb_cur, log_prob_c + prefix->log_prob_nb_prev);
}
// get new prefix
auto prefix_new = prefix->get_path_trie(c, time_step, log_prob_c);
if (prefix_new != nullptr) {
float log_p = -NUM_FLT_INF;
if (c == prefix->character &&
prefix->log_prob_b_prev > -NUM_FLT_INF) {
log_p = log_prob_c + prefix->log_prob_b_prev;
} else if (c != prefix->character) {
log_p = log_prob_c + prefix->score;
}
// language model scoring
if (ext_scorer != nullptr &&
(c == space_id || ext_scorer->is_character_based())) {
PathTrie *prefix_to_score = nullptr;
// skip scoring the space
if (ext_scorer->is_character_based()) {
prefix_to_score = prefix_new;
} else {
prefix_to_score = prefix;
}
float score = 0.0;
std::vector<std::string> ngram;
ngram = ext_scorer->make_ngram(prefix_to_score);
score = ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha;
log_p += score;
log_p += ext_scorer->beta;
}
prefix_new->log_prob_nb_cur =
log_sum_exp(prefix_new->log_prob_nb_cur, log_p);
}
} // end of loop over prefix
} // end of loop over vocabulary

And then finally you'd have a decoder_finish() or decoder_decode() step that does the final score adjustments if necessary and returns a list of decoder strings.

prefixes.clear();
// update log probs
root.iterate_to_vec(prefixes);
// only preserve top beam_size prefixes
if (prefixes.size() >= beam_size) {
std::nth_element(prefixes.begin(),
prefixes.begin() + beam_size,
prefixes.end(),
prefix_compare);
for (size_t i = beam_size; i < prefixes.size(); ++i) {
prefixes[i]->remove();
}
}
} // end of loop over time
// score the last word of each prefix that doesn't end with space
if (ext_scorer != nullptr && !ext_scorer->is_character_based()) {
for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
auto prefix = prefixes[i];
if (!prefix->is_empty() && prefix->character != space_id) {
float score = 0.0;
std::vector<std::string> ngram = ext_scorer->make_ngram(prefix);
score = ext_scorer->get_log_cond_prob(ngram) * ext_scorer->alpha;
score += ext_scorer->beta;
prefix->score += score;
}
}
}
size_t num_prefixes = std::min(prefixes.size(), beam_size);
std::sort(prefixes.begin(), prefixes.begin() + num_prefixes, prefix_compare);
// compute aproximate ctc score as the return score, without affecting the
// return order of decoding result. To delete when decoder gets stable.
for (size_t i = 0; i < beam_size && i < prefixes.size(); ++i) {
double approx_ctc = prefixes[i]->score;
if (ext_scorer != nullptr) {
std::vector<int> output;
std::vector<int> timesteps;
prefixes[i]->get_path_vec(output, timesteps);
auto prefix_length = output.size();
auto words = ext_scorer->split_labels(output);
// remove word insert
approx_ctc = approx_ctc - prefix_length * ext_scorer->beta;
// remove language model weight:
approx_ctc -= (ext_scorer->get_sent_log_prob(words)) * ext_scorer->alpha;
}
prefixes[i]->approx_ctc = approx_ctc;
}
return get_beam_search_result(prefixes, beam_size);

This step could then be called from DS_IntermediateDecode to quickly get the current decoding of the stream without having to always start from scratch. With this API in place, after a batch is computed with the acoustic model we can immediately feed the probabilities into the decoder_next() step. I'm fairly certain there are more performance gains to be had in the decoder, but this would be an amazing first step.

@reuben
Copy link
Contributor

reuben commented May 10, 2019

In the end, here's how the API would be used:

DS_SetupStream() -> decoder_init
DS_FeedAudioContent() -> ... -> StreamingState::processBatch -> decoder_next
DS_IntermediateDecode() and DS_FinishStream -> decoder_decode

@dabinat
Copy link
Collaborator

dabinat commented May 18, 2019

I refactored all of this and got it working. Just need to tidy it up a bit and then I'll post a PR.

@dabinat
Copy link
Collaborator

dabinat commented May 20, 2019

See PR #2121.

@dabinat
Copy link
Collaborator

dabinat commented May 22, 2019

This has now been merged with master so I'll close this issue.

@dabinat dabinat closed this as completed May 22, 2019
@lock
Copy link

lock bot commented Jun 21, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Jun 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants