Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify stream's pcmf32 handling #2693

Closed
wants to merge 3 commits into from
Closed

Conversation

tamo
Copy link
Contributor

@tamo tamo commented Jan 2, 2025

Use one deque instead of two vectors (old and new).
Old and new are length variables now.

Basically: Get step - new samples every time.
Then substitute new = (around) step;
The new audio data is simply appended to the deque.
(Limit the deque size to 30 seconds.)
Pass old + new samples to whisper inference.

If the data has been consumed, let old = 0; new = 0;
If some of the data should be kept for the next iter, old = keep;
If you want to get only N samples next time, new = step - N;

In VAD mode: stream --interim --step -3000 will
Get 3000ms of audio.
Run vad_simple(step_ms).
If nothing is detected, get 100ms more audio and retry.
If nothing is detected and 3000ms has been passed,
go into the interim mode,
where n_segments - 1 segments will be confirmed.
(old -= confirmed_t1)
If n_segments == 1, only show the first half of the result.

Misc:
Increase the default max_tokens because 32 is too small for 10 seconds.
(Some Japanese speech was garbled.)
Write wav as soon as the data is available.

no_timestamps is the default even for VAD
because it is more useful to show to the hard-of-hearing

tamo added 3 commits January 2, 2025 00:18
Without it, `stream --save-audio` produces somehow choppy wav:
`stream` calculates t_diff in milliseconds
and combine audio pieces which are about step_ms long.

WHISPER_SAMPLE_RATE / 1000 == only 16

but surprisingly human ears seem to be able to hear the gap
as a noise.
Use one deque instead of two vectors (old and new).
Old and new are length variables now.

Basically: Get `step - new` samples every time.
Then substitute `new = (around) step;`
The new audio data is simply appended to the deque.
(Limit the deque size to 30 seconds.)
Pass `old + new` samples to whisper inference.

If the data has been consumed, let `old = 0; new = 0;`
If some of the data should be kept for the next iter, `old = keep;`
If you want to get only N samples next time, `new = step - N;`

In VAD mode: `stream --interim --step -3000` will
Get 3000ms of audio.
Run `vad_simple(step_ms)`.
If nothing is detected, get 100ms more audio and retry.
If nothing is detected and 3000ms has been passed,
go into the interim mode,
where `n_segments - 1` segments will be confirmed.
(`old -= confirmed_t1`)
If `n_segments == 1`, only show the first half of the result.

Misc:
Increase the default `max_tokens` because 32 is too small for 10 seconds.
(Some Japanese speech was garbled.)
Write wav as soon as the data is available.

`no_timestamps` is the default even for VAD
because it is more useful to show to the hard-of-hearing
@tamo
Copy link
Contributor Author

tamo commented Jan 3, 2025

Closing, because a number of fixes are in #2694 now.

@tamo tamo closed this Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant