Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use the fine-tune code for Vietnamese? #221

Closed
PhamDangNguyen opened this issue Oct 23, 2024 · 5 comments
Closed

How can I use the fine-tune code for Vietnamese? #221

PhamDangNguyen opened this issue Oct 23, 2024 · 5 comments

Comments

@PhamDangNguyen
Copy link

I wan to using code to training model for Vietnamese domains. I using phoneme for vocab. How I can get it?

@vu-the-dung
Copy link

vu-the-dung commented Oct 23, 2024

#57 (comment)
This is a good starting point, I'm currently following this to train for Vietnamese.

You'll need to edit vocab.txt to include missing Vietnamese characters, mainly accented vowels for both uppercase and lowercase (ắ, ấ, ồ, Ố...), replace them with unused characters (like Chinese, Korean characters)

Edit convert_char_to_pinyin in /model/utils.py to something that only convert string to char array, like:

def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = [char for char in text_list if char not in "。,、;:?!《》【】—…:;?!\"()[]{}"]
    return final_text_list

@PhamDangNguyen
Copy link
Author

Could you provide me with your Vietnamese vocab.txt? I haven't set it up yet, and I really need it for testing.

#57 (comment) This is a good starting point, I'm currently following this to train for Vietnamese.

You'll need to edit vocab.txt to include missing Vietnamese characters, mainly accented vowels for both uppercase and lowercase (ắ, ấ, ồ, Ố...), replace them with unused characters (like Chinese, Korean characters)

Edit convert_char_to_pinyin in /model/utils.py to something that only convert string to char array, like:

def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = [char for char in text_list if char not in "。,、;:?!《》【】—…:;?!\"()[]{}"]
    return final_text_list

@vu-the-dung
Copy link

Could you provide me with your Vietnamese vocab.txt? I haven't set it up yet, and I really need it for testing.

#57 (comment) This is a good starting point, I'm currently following this to train for Vietnamese.
You'll need to edit vocab.txt to include missing Vietnamese characters, mainly accented vowels for both uppercase and lowercase (ắ, ấ, ồ, Ố...), replace them with unused characters (like Chinese, Korean characters)
Edit convert_char_to_pinyin in /model/utils.py to something that only convert string to char array, like:

def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = [char for char in text_list if char not in "。,、;:?!《》【】—…:;?!\"()[]{}"]
    return final_text_list

I don't have access to my training PC now, basically you can just make a list of all uppercase and lowercase Vietnamese vowels and their accented combo ['a', 'á', 'à', 'ã', 'ạ', 'ả', 'A', 'Á', 'À', 'Ã', 'Ạ', 'Ả',...], then check each of them if it already existed in vocab.txt or not, then replace some of unused characters in vocab.txt with the missing Vietnamese ones. It should be easy to write a script to do this

@PhamDangNguyen
Copy link
Author

Could you provide me with your Vietnamese vocab.txt? I haven't set it up yet, and I really need it for testing.

#57 (comment) This is a good starting point, I'm currently following this to train for Vietnamese.
You'll need to edit vocab.txt to include missing Vietnamese characters, mainly accented vowels for both uppercase and lowercase (ắ, ấ, ồ, Ố...), replace them with unused characters (like Chinese, Korean characters)
Edit convert_char_to_pinyin in /model/utils.py to something that only convert string to char array, like:

def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = [char for char in text_list if char not in "。,、;:?!《》【】—…:;?!\"()[]{}"]
    return final_text_list

I don't have access to my training PC now, basically you can just make a list of all uppercase and lowercase Vietnamese vowels and their accented combo ['a', 'á', 'à', 'ã', 'ạ', 'ả', 'A', 'Á', 'À', 'Ã', 'Ạ', 'Ả',...], then check each of them if it already existed in vocab.txt or not, then replace some of unused characters in vocab.txt with the missing Vietnamese ones. It should be easy to write a script to do this

Thank you for your help. This is very useful to me.

@khoavietcode
Copy link

where can i find the Vietnamese version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants