Calculate duration of input files in order to decide which engine to use #136

philmcmahon · 2025-02-05T16:17:33Z

What does this change?

Currently we have two available 'engines' for transcription:

whisperx - supports diarization, is very fast once it gets started, better output formatting, very slow startup (~10mins)
whisper.cpp - fast(er) startup, less pretty output formatting, no diarization

It seems a shame to wait to spin up whisperx for a file that is only a few minutes long. This PR gets the API to make a decision on which engine to use based on the duration of the file - any files under 10 minutes long where the user hasn't requested diarization will be sent to whisper.cpp, everything else goes to whisperX

To get ffprobe running on lambda I had to create a lambda layer - this is basically just a zip file that gets added to the disk of the lambda vm.

How to test

Tested on CODE

zekehuntergreen · 2025-02-06T14:06:58Z

packages/backend-common/src/ffmpeg.ts

+		return parseFloat(stdout);
+	} catch (error) {
+		logger.error(`Error during ffprobe file duration detection`, error);
+		throw error;


will we return a 500 if we can't find file duration here? might be better to default to whisperx

good shout, implemented in 5d7709a

zekehuntergreen · 2025-02-06T14:17:15Z

packages/common/src/types.ts

@@ -69,6 +69,8 @@ export const TranscriptionJob = z.object({
 	engine: z.nativeEnum(TranscriptionEngine),
 	// we can get rid of this when we switch to using a zip
 	translationOutputBucketUrls: z.optional(OutputBucketUrls),
+	// this is optional because giant currently doesn't know file duration


do ffmpeg or the whisper models output file duration? Or would it be worth having the worker calculate duration with ffprobe? That would unify the way we're finding duration and make it so Giant isn't a special case.

Good point. I was more just shoving this in in case useful for logging - the only place it was used was in setting a 'duration' field in the transcription failure message, which isn't really necessary. So I've removed this field in 244a6d2

github-actions · 2025-02-20T11:05:22Z

Deploy build 801 of `investigations::transcription-service-whisperx-model-fetch` to CODE

All deployment options

From guardian/actions-riff-raff.

github-actions · 2025-02-20T11:07:17Z

Deploy build 921 of `investigations::transcription-service` to CODE

All deployment options

From guardian/actions-riff-raff.

github-actions · 2025-02-20T11:07:19Z

Deploy build 801 of `investigations::transcription-service-repository` to CODE

All deployment options

From guardian/actions-riff-raff.

philmcmahon requested a review from a team as a code owner February 5, 2025 16:17

zekehuntergreen reviewed Feb 6, 2025

View reviewed changes

philmcmahon added 9 commits February 24, 2025 12:22

Add ffmpeg layer infra

bb28310

Add test-ffmpeg endpoint

1e7a5df

Bump api ephemeral storage

9397f74

Calculate duration of file in API, use to decide which engine to use

2411d4c

Use whisperX when files > 10mins

f080df5

Use specific hashed version of ffmpeg zip

4e3ea53

Add documentation, tidy up

ae48da5

Don't throw an error if ffprobe fails - instead default to whisperx

c8302cc

Remove duration from transcript job

1833874

philmcmahon force-pushed the api-ffmpeg branch from 244a6d2 to 1833874 Compare February 24, 2025 12:22

zekehuntergreen approved these changes Feb 24, 2025

View reviewed changes

philmcmahon merged commit 0889e0a into main Feb 25, 2025
6 checks passed

philmcmahon deleted the api-ffmpeg branch February 25, 2025 10:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate duration of input files in order to decide which engine to use #136

Calculate duration of input files in order to decide which engine to use #136

philmcmahon commented Feb 5, 2025

zekehuntergreen Feb 6, 2025

philmcmahon Feb 20, 2025

zekehuntergreen Feb 6, 2025

philmcmahon Feb 20, 2025

github-actions bot commented Feb 20, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading

github-actions bot commented Feb 20, 2025 •

edited

Loading

Calculate duration of input files in order to decide which engine to use #136

Calculate duration of input files in order to decide which engine to use #136

Conversation

philmcmahon commented Feb 5, 2025

What does this change?

How to test

zekehuntergreen Feb 6, 2025

Choose a reason for hiding this comment

philmcmahon Feb 20, 2025

Choose a reason for hiding this comment

zekehuntergreen Feb 6, 2025

Choose a reason for hiding this comment

philmcmahon Feb 20, 2025

Choose a reason for hiding this comment

github-actions bot commented Feb 20, 2025 • edited Loading

Deploy build 801 of investigations::transcription-service-whisperx-model-fetch to CODE

github-actions bot commented Feb 20, 2025 • edited Loading

Deploy build 921 of investigations::transcription-service to CODE

github-actions bot commented Feb 20, 2025 • edited Loading

Deploy build 801 of investigations::transcription-service-repository to CODE

github-actions bot commented Feb 20, 2025 •

edited

Loading

Deploy build 801 of `investigations::transcription-service-whisperx-model-fetch` to CODE

github-actions bot commented Feb 20, 2025 •

edited

Loading

Deploy build 921 of `investigations::transcription-service` to CODE

github-actions bot commented Feb 20, 2025 •

edited

Loading

Deploy build 801 of `investigations::transcription-service-repository` to CODE