Simplify stream's pcmf32 handling #2693

tamo · 2025-01-02T03:05:38Z

Use one deque instead of two vectors (old and new).
Old and new are length variables now.

Basically: Get step - new samples every time.
Then substitute new = (around) step;
The new audio data is simply appended to the deque.
(Limit the deque size to 30 seconds.)
Pass old + new samples to whisper inference.

If the data has been consumed, let old = 0; new = 0;
If some of the data should be kept for the next iter, old = keep;
If you want to get only N samples next time, new = step - N;

In VAD mode: stream --interim --step -3000 will
Get 3000ms of audio.
Run vad_simple(step_ms).
If nothing is detected, get 100ms more audio and retry.
If nothing is detected and 3000ms has been passed,
go into the interim mode,
where n_segments - 1 segments will be confirmed.
(old -= confirmed_t1)
If n_segments == 1, only show the first half of the result.

Misc:
Increase the default max_tokens because 32 is too small for 10 seconds.
(Some Japanese speech was garbled.)
Write wav as soon as the data is available.

no_timestamps is the default even for VAD
because it is more useful to show to the hard-of-hearing

Without it, `stream --save-audio` produces somehow choppy wav: `stream` calculates t_diff in milliseconds and combine audio pieces which are about step_ms long. WHISPER_SAMPLE_RATE / 1000 == only 16 but surprisingly human ears seem to be able to hear the gap as a noise.

Use one deque instead of two vectors (old and new). Old and new are length variables now. Basically: Get `step - new` samples every time. Then substitute `new = (around) step;` The new audio data is simply appended to the deque. (Limit the deque size to 30 seconds.) Pass `old + new` samples to whisper inference. If the data has been consumed, let `old = 0; new = 0;` If some of the data should be kept for the next iter, `old = keep;` If you want to get only N samples next time, `new = step - N;` In VAD mode: `stream --interim --step -3000` will Get 3000ms of audio. Run `vad_simple(step_ms)`. If nothing is detected, get 100ms more audio and retry. If nothing is detected and 3000ms has been passed, go into the interim mode, where `n_segments - 1` segments will be confirmed. (`old -= confirmed_t1`) If `n_segments == 1`, only show the first half of the result. Misc: Increase the default `max_tokens` because 32 is too small for 10 seconds. (Some Japanese speech was garbled.) Write wav as soon as the data is available. `no_timestamps` is the default even for VAD because it is more useful to show to the hard-of-hearing

tamo · 2025-01-03T07:12:56Z

Closing, because a number of fixes are in #2694 now.

tamo added 3 commits January 2, 2025 00:18

Add headers for gcc c++

b27fc1f

tamo closed this Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify stream's pcmf32 handling #2693

Simplify stream's pcmf32 handling #2693

tamo commented Jan 2, 2025

tamo commented Jan 3, 2025

Simplify stream's pcmf32 handling #2693

Simplify stream's pcmf32 handling #2693

Conversation

tamo commented Jan 2, 2025

tamo commented Jan 3, 2025