Simplify stream's pcmf32 handling #2693
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use one deque instead of two vectors (old and new).
Old and new are length variables now.
Basically: Get
step - new
samples every time.Then substitute
new = (around) step;
The new audio data is simply appended to the deque.
(Limit the deque size to 30 seconds.)
Pass
old + new
samples to whisper inference.If the data has been consumed, let
old = 0; new = 0;
If some of the data should be kept for the next iter,
old = keep;
If you want to get only N samples next time,
new = step - N;
In VAD mode:
stream --interim --step -3000
willGet 3000ms of audio.
Run
vad_simple(step_ms)
.If nothing is detected, get 100ms more audio and retry.
If nothing is detected and 3000ms has been passed,
go into the interim mode,
where
n_segments - 1
segments will be confirmed.(
old -= confirmed_t1
)If
n_segments == 1
, only show the first half of the result.Misc:
Increase the default
max_tokens
because 32 is too small for 10 seconds.(Some Japanese speech was garbled.)
Write wav as soon as the data is available.
no_timestamps
is the default even for VADbecause it is more useful to show to the hard-of-hearing