Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate duration of input files in order to decide which engine to use #136

Merged
merged 9 commits into from
Feb 25, 2025

Conversation

philmcmahon
Copy link
Contributor

What does this change?

Currently we have two available 'engines' for transcription:

  • whisperx - supports diarization, is very fast once it gets started, better output formatting, very slow startup (~10mins)
  • whisper.cpp - fast(er) startup, less pretty output formatting, no diarization

It seems a shame to wait to spin up whisperx for a file that is only a few minutes long. This PR gets the API to make a decision on which engine to use based on the duration of the file - any files under 10 minutes long where the user hasn't requested diarization will be sent to whisper.cpp, everything else goes to whisperX

To get ffprobe running on lambda I had to create a lambda layer - this is basically just a zip file that gets added to the disk of the lambda vm.

How to test

Tested on CODE

@philmcmahon philmcmahon requested a review from a team as a code owner February 5, 2025 16:17
return parseFloat(stdout);
} catch (error) {
logger.error(`Error during ffprobe file duration detection`, error);
throw error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we return a 500 if we can't find file duration here? might be better to default to whisperx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good shout, implemented in 5d7709a

@@ -69,6 +69,8 @@ export const TranscriptionJob = z.object({
engine: z.nativeEnum(TranscriptionEngine),
// we can get rid of this when we switch to using a zip
translationOutputBucketUrls: z.optional(OutputBucketUrls),
// this is optional because giant currently doesn't know file duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do ffmpeg or the whisper models output file duration? Or would it be worth having the worker calculate duration with ffprobe? That would unify the way we're finding duration and make it so Giant isn't a special case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I was more just shoving this in in case useful for logging - the only place it was used was in setting a 'duration' field in the transcription failure message, which isn't really necessary. So I've removed this field in 244a6d2

@philmcmahon philmcmahon merged commit 0889e0a into main Feb 25, 2025
6 checks passed
@philmcmahon philmcmahon deleted the api-ffmpeg branch February 25, 2025 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants