Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please update the blog to fix: "Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription" #1948

Closed
d2a-raudenaerde opened this issue Apr 1, 2024 · 9 comments

Comments

@d2a-raudenaerde
Copy link

I'm trying to trying to do the "https://huggingface.co/blog/fine-tune-whisper" so I setup an gpu supporterd jupyter.

However, after the first evaluation print (1000 steps), I get this error:

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass language='en'.

I'm sure there is a workaround, but could you please update the blog with these new settings?

@d2a-raudenaerde
Copy link
Author

Ok i was too fast maybe it has a fix in #1944?

@pcuenca
Copy link
Member

pcuenca commented Apr 1, 2024

Yes, that PR was just merged :) Can you give a try and see if it works for you?

@d2a-raudenaerde
Copy link
Author

d2a-raudenaerde commented Apr 1, 2024

ValueError: Unsupported language: ('hindi',). Language should be one of: ['english', 'chinese', 'german', 'spanish', 'russian', 'korean', 'french', 'japanese', 'portuguese', 'turkish', 'polish', 'catalan', 'dutch', 'arabic', 'swedish', 'italian', 'indonesian', 'hindi', 'finnish', 'vietnamese', 'hebrew', 'ukrainian', 'greek', 'malay', 'czech', 'romanian', 'danish', 'hungarian', 'tamil', 'norwegian', 'thai', 'urdu', 'croatian', 'bulgarian', 'lithuanian', 'latin', 'maori', 'malayalam', 'welsh', 'slovak', 'telugu', 'persian', 'latvian', 'bengali', 'serbian', 'azerbaijani', 'slovenian', 'kannada', 'estonian', 'macedonian', 'breton', 'basque', 'icelandic', 'armenian', 'nepali', 'mongolian', 'bosnian', 'kazakh', 'albanian', 'swahili', 'galician', 'marathi', 'punjabi', 'sinhala', 'khmer', 'shona', 'yoruba', 'somali', 'afrikaans', 'occitan', 'georgian', 'belarusian', 'tajik', 'sindhi', 'gujarati', 'amharic', 'yiddish', 'lao', 'uzbek', 'faroese', 'haitian creole', 'pashto', 'turkmen', 'nynorsk', 'maltese', 'sanskrit', 'luxembourgish', 'myanmar', 'tibetan', 'tagalog', 'malagasy', 'assamese', 'tatar', 'hawaiian', 'lingala', 'hausa', 'bashkir', 'javanese', 'sundanese', 'cantonese', 'burmese', 'valencian', 'flemish', 'haitian', 'letzeburgesch', 'pushto', 'panjabi', 'moldavian', 'moldovan', 'sinhalese', 'castilian', 'mandarin'].

image

@d2a-raudenaerde
Copy link
Author

I only copied the model config part (searched for 'hindi' in the source')

model.generation_config.language = "hindi",
model.generation_config.task = "transcribe",
model.generation_config.forced_decoder_ids = None

Maybe I forgot it somewhere.. Will check!

@d2a-raudenaerde
Copy link
Author

Ah the error shows it is a tuple ('hindi,) and the list contains 'hindi' as a regular string.

@pcuenca
Copy link
Member

pcuenca commented Apr 1, 2024

Yes, I think those trailing commas in your code snippet should not be there. This is how the blog shows up for me:

Screenshot 2024-04-01 at 12 37 01

@d2a-raudenaerde
Copy link
Author

Oh I see it is a copy paste error :S

@d2a-raudenaerde
Copy link
Author

d2a-raudenaerde commented Apr 1, 2024

Ok now it goes further. I reduced the evaluation steps to 100 to see error sooner. However, the output seems empty?
image

@d2a-raudenaerde
Copy link
Author

Ok I see this is to be expected and the second bar is the 'evaluation' process. Seems to work as I now have a first evaluation after 100 steps. (not recommended as evalutation takes about 10 minutes on my system)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants