How can I use the fine-tune code for Vietnamese? #221

PhamDangNguyen · 2024-10-23T04:54:04Z

I wan to using code to training model for Vietnamese domains. I using phoneme for vocab. How I can get it?

vu-the-dung · 2024-10-23T06:17:53Z

#57 (comment)
This is a good starting point, I'm currently following this to train for Vietnamese.

You'll need to edit vocab.txt to include missing Vietnamese characters, mainly accented vowels for both uppercase and lowercase (ắ, ấ, ồ, Ố...), replace them with unused characters (like Chinese, Korean characters)

Edit convert_char_to_pinyin in /model/utils.py to something that only convert string to char array, like:

def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = [char for char in text_list if char not in "。，、；：？！《》【】—…:;?!\"()[]{}"]
    return final_text_list

PhamDangNguyen · 2024-10-23T06:57:48Z

Could you provide me with your Vietnamese vocab.txt? I haven't set it up yet, and I really need it for testing.

#57 (comment) This is a good starting point, I'm currently following this to train for Vietnamese.

You'll need to edit vocab.txt to include missing Vietnamese characters, mainly accented vowels for both uppercase and lowercase (ắ, ấ, ồ, Ố...), replace them with unused characters (like Chinese, Korean characters)

Edit convert_char_to_pinyin in /model/utils.py to something that only convert string to char array, like:
def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = [char for char in text_list if char not in "。，、；：？！《》【】—…:;?!\"()[]{}"]
    return final_text_list

vu-the-dung · 2024-10-23T07:14:07Z

Could you provide me with your Vietnamese vocab.txt? I haven't set it up yet, and I really need it for testing.
#57 (comment) This is a good starting point, I'm currently following this to train for Vietnamese.
You'll need to edit vocab.txt to include missing Vietnamese characters, mainly accented vowels for both uppercase and lowercase (ắ, ấ, ồ, Ố...), replace them with unused characters (like Chinese, Korean characters)
Edit convert_char_to_pinyin in /model/utils.py to something that only convert string to char array, like:
def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = [char for char in text_list if char not in "。，、；：？！《》【】—…:;?!\"()[]{}"]
    return final_text_list

I don't have access to my training PC now, basically you can just make a list of all uppercase and lowercase Vietnamese vowels and their accented combo ['a', 'á', 'à', 'ã', 'ạ', 'ả', 'A', 'Á', 'À', 'Ã', 'Ạ', 'Ả',...], then check each of them if it already existed in vocab.txt or not, then replace some of unused characters in vocab.txt with the missing Vietnamese ones. It should be easy to write a script to do this

PhamDangNguyen · 2024-10-23T07:16:53Z

Could you provide me with your Vietnamese vocab.txt? I haven't set it up yet, and I really need it for testing.
#57 (comment) This is a good starting point, I'm currently following this to train for Vietnamese.
You'll need to edit vocab.txt to include missing Vietnamese characters, mainly accented vowels for both uppercase and lowercase (ắ, ấ, ồ, Ố...), replace them with unused characters (like Chinese, Korean characters)
Edit convert_char_to_pinyin in /model/utils.py to something that only convert string to char array, like:
def convert_char_to_pinyin(text_list, polyphone=True):
    final_text_list = [char for char in text_list if char not in "。，、；：？！《》【】—…:;?!\"()[]{}"]
    return final_text_list
I don't have access to my training PC now, basically you can just make a list of all uppercase and lowercase Vietnamese vowels and their accented combo ['a', 'á', 'à', 'ã', 'ạ', 'ả', 'A', 'Á', 'À', 'Ã', 'Ạ', 'Ả',...], then check each of them if it already existed in vocab.txt or not, then replace some of unused characters in vocab.txt with the missing Vietnamese ones. It should be easy to write a script to do this

Thank you for your help. This is very useful to me.

khoavietcode · 2025-01-04T18:54:53Z

where can i find the Vietnamese version?

PhamDangNguyen closed this as completed Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I use the fine-tune code for Vietnamese? #221

How can I use the fine-tune code for Vietnamese? #221

PhamDangNguyen commented Oct 23, 2024

vu-the-dung commented Oct 23, 2024 •

edited

Loading

PhamDangNguyen commented Oct 23, 2024

vu-the-dung commented Oct 23, 2024

PhamDangNguyen commented Oct 23, 2024

khoavietcode commented Jan 4, 2025

How can I use the fine-tune code for Vietnamese? #221

How can I use the fine-tune code for Vietnamese? #221

Comments

PhamDangNguyen commented Oct 23, 2024

vu-the-dung commented Oct 23, 2024 • edited Loading

PhamDangNguyen commented Oct 23, 2024

vu-the-dung commented Oct 23, 2024

PhamDangNguyen commented Oct 23, 2024

khoavietcode commented Jan 4, 2025

vu-the-dung commented Oct 23, 2024 •

edited

Loading