-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task: Pretraining dataset for Multilingual Quantizer training (Phase 1) #118
Comments
These datasets will be saved on S3 (which can be useful for connecting RunPod later) and will be saved on disk with the A6000s cluster for training Quantizer Whisper Encoder and Ichigo. |
Downloaded full Vietnamese datasets from HF and upload to S3, check it! |
is this duplication? |
Can we add the license information next to each dataset - I am concerned that we are using some datasets that are non-commercial licensed |
Just updated license information for each dataset. I think this issue is done, and we can close. |
All good thanks @tuanlda78202 |
Goal
Gathering all the data opensource available for the first multilingual training run. This test will focus mostly in Vietnamese dataset.
Tasklist
Multi-lingual Base
Vietnamese
Singlish
The text was updated successfully, but these errors were encountered: