Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task: Pretraining dataset for Multilingual Quantizer training (Phase 1) #118

Closed
9 tasks done
Tracked by #116
hahuyhoang411 opened this issue Nov 19, 2024 · 6 comments
Closed
9 tasks done
Tracked by #116
Assignees
Labels
type: epic A major feature or initiative

Comments

@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Nov 19, 2024

Goal

Gathering all the data opensource available for the first multilingual training run. This test will focus mostly in Vietnamese dataset.

Tasklist

Multi-lingual Base

Vietnamese

  • ViVoice (1000+ hours, CC-BY-NC-SA-4.0)
  • BUD500 (500+ hours, CC-BY-NC-SA-4.0)
  • VLSP (100 hours)

Singlish

@tuanlda78202
Copy link
Contributor

These datasets will be saved on S3 (which can be useful for connecting RunPod later) and will be saved on disk with the A6000s cluster for training Quantizer Whisper Encoder and Ichigo.

@tuanlda78202
Copy link
Contributor

tuanlda78202 commented Nov 22, 2024

Downloaded full Vietnamese datasets from HF and upload to S3, check it!

@hiento09 hiento09 added this to Menlo Nov 22, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Nov 22, 2024
@tikikun tikikun moved this from Investigating to In Progress in Menlo Nov 25, 2024
@tikikun
Copy link
Collaborator

tikikun commented Nov 25, 2024

is this duplication?

@dan-menlo dan-menlo changed the title task: Dataset Collection & Preprocessing task: Multilingual Dataset Collection & Preprocessing Nov 27, 2024
@dan-menlo dan-menlo changed the title task: Multilingual Dataset Collection & Preprocessing task: Multilingual Dataset Collection & Preprocessing (Phase 1) Nov 27, 2024
@dan-menlo
Copy link
Contributor

Can we add the license information next to each dataset - I am concerned that we are using some datasets that are non-commercial licensed

@dan-menlo dan-menlo changed the title task: Multilingual Dataset Collection & Preprocessing (Phase 1) task: Pretraining dataset for Multilingual (Phase 1) Nov 27, 2024
@dan-menlo dan-menlo changed the title task: Pretraining dataset for Multilingual (Phase 1) task: Pretraining dataset for Multilingual Quantizer training (Phase 1) Nov 27, 2024
@tuanlda78202
Copy link
Contributor

tuanlda78202 commented Nov 28, 2024

Just updated license information for each dataset. I think this issue is done, and we can close.

@hahuyhoang411 hahuyhoang411 moved this from In Progress to Completed in Menlo Nov 29, 2024
@hahuyhoang411
Copy link
Contributor Author

All good thanks @tuanlda78202

@github-project-automation github-project-automation bot moved this from Completed to Review + QA in Menlo Nov 29, 2024
@hahuyhoang411 hahuyhoang411 moved this from Review + QA to Completed in Menlo Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: epic A major feature or initiative
Projects
Archived in project
Development

No branches or pull requests

4 participants