[feature request] Support for batch transcription #26

G-Thor · 2022-08-23T11:42:39Z

The current state of this package does not offer batch transcription, with sentences being passed through one at a time.
This would be completely fine but it seems like some dependencies, such as IceNLP are being intialized on every call.
This severely slows down transcription, making batch transcriptions (in the range of thousands of utterances) very time consuming.

It would be nice to be able to pass in a list of strings and get a list of lists of Tokens, with matching indices, for example.

bnika · 2022-08-23T17:04:59Z

We have it on our issue list to deamonize IceNLP, the current implementation slows the process down indeed. A temporary solution would be to set phrasing=False in textprocessing_manager.transcribe() if you can do without it, IceNLP is only used in the phrasing step.
The LSTM-model is also too slow in general, we need to do some research in that part as well.

G-Thor · 2022-08-24T11:24:14Z

Unfortunately, it's the phrasing part I was most interested in applying, since my data is already normalized and I have other options for G2P.
Perhaps I'll just use a more naïve phrasing approach while the issues with this dependency get worked out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Support for batch transcription #26

[feature request] Support for batch transcription #26

G-Thor commented Aug 23, 2022

bnika commented Aug 23, 2022

G-Thor commented Aug 24, 2022

[feature request] Support for batch transcription #26

[feature request] Support for batch transcription #26

Comments

G-Thor commented Aug 23, 2022

bnika commented Aug 23, 2022

G-Thor commented Aug 24, 2022