task: Pretraining dataset for Multilingual Quantizer training (Phase 1) #118

hahuyhoang411 · 2024-11-19T16:54:34Z

Goal

Gathering all the data opensource available for the first multilingual training run. This test will focus mostly in Vietnamese dataset.

Tasklist

Multi-lingual Base

LibriSpeech (CC-BY-4.0) or check parler-tss for the cleaned version
Mozilla Common Voice (CC-0)
GigaSpeech2 Refined (Vie, Thai & Indo, Apache-2.0)
FLEURS (CC-BY-4.0)
Emilia (CC-BY-NC-4.0)
...

Vietnamese

ViVoice (1000+ hours, CC-BY-NC-SA-4.0)
BUD500 (500+ hours, CC-BY-NC-SA-4.0)
VLSP (100 hours)

Singlish

National Speech Corpus (Singapore Open Data License)

tuanlda78202 · 2024-11-20T16:15:10Z

These datasets will be saved on S3 (which can be useful for connecting RunPod later) and will be saved on disk with the A6000s cluster for training Quantizer Whisper Encoder and Ichigo.

tuanlda78202 · 2024-11-22T02:38:48Z

Downloaded full Vietnamese datasets from HF and upload to S3, check it!

tikikun · 2024-11-25T03:30:21Z

is this duplication?

dan-menlo · 2024-11-27T06:25:46Z

Can we add the license information next to each dataset - I am concerned that we are using some datasets that are non-commercial licensed

tuanlda78202 · 2024-11-28T05:19:21Z

Just updated license information for each dataset. I think this issue is done, and we can close.

hahuyhoang411 · 2024-11-29T06:49:36Z

All good thanks @tuanlda78202

hahuyhoang411 added the type: epic A major feature or initiative label Nov 19, 2024

hahuyhoang411 assigned tuanlda78202 Nov 19, 2024

hahuyhoang411 mentioned this issue Nov 19, 2024

milestone: Ichigo v0.5 Multi-lingual #116

Open

7 tasks

hiento09 added this to Menlo Nov 22, 2024

github-project-automation bot moved this to Investigating in Menlo Nov 22, 2024

tikikun moved this from Investigating to In Progress in Menlo Nov 25, 2024

hahuyhoang411 added this to the Ichigo v0.5 - Multilingual milestone Nov 25, 2024

dan-menlo changed the title ~~task: Dataset Collection & Preprocessing~~ task: Multilingual Dataset Collection & Preprocessing Nov 27, 2024

dan-menlo changed the title ~~task: Multilingual Dataset Collection & Preprocessing~~ task: Multilingual Dataset Collection & Preprocessing (Phase 1) Nov 27, 2024

dan-menlo changed the title ~~task: Multilingual Dataset Collection & Preprocessing (Phase 1)~~ task: Pretraining dataset for Multilingual (Phase 1) Nov 27, 2024

dan-menlo changed the title ~~task: Pretraining dataset for Multilingual (Phase 1)~~ task: Pretraining dataset for Multilingual Quantizer training (Phase 1) Nov 27, 2024

hahuyhoang411 moved this from In Progress to Completed in Menlo Nov 29, 2024

hahuyhoang411 closed this as completed Nov 29, 2024

github-project-automation bot moved this from Completed to Review + QA in Menlo Nov 29, 2024

hahuyhoang411 moved this from Review + QA to Completed in Menlo Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task: Pretraining dataset for Multilingual Quantizer training (Phase 1) #118

task: Pretraining dataset for Multilingual Quantizer training (Phase 1) #118

hahuyhoang411 commented Nov 19, 2024 •

edited by tuanlda78202

Loading

tuanlda78202 commented Nov 20, 2024

tuanlda78202 commented Nov 22, 2024 •

edited

Loading

tikikun commented Nov 25, 2024

dan-menlo commented Nov 27, 2024

tuanlda78202 commented Nov 28, 2024 •

edited

Loading

hahuyhoang411 commented Nov 29, 2024

task: Pretraining dataset for Multilingual Quantizer training (Phase 1) #118

task: Pretraining dataset for Multilingual Quantizer training (Phase 1) #118

Comments

hahuyhoang411 commented Nov 19, 2024 • edited by tuanlda78202 Loading

Goal

Tasklist

Multi-lingual Base

Vietnamese

Singlish

tuanlda78202 commented Nov 20, 2024

tuanlda78202 commented Nov 22, 2024 • edited Loading

tikikun commented Nov 25, 2024

dan-menlo commented Nov 27, 2024

tuanlda78202 commented Nov 28, 2024 • edited Loading

hahuyhoang411 commented Nov 29, 2024

hahuyhoang411 commented Nov 19, 2024 •

edited by tuanlda78202

Loading

tuanlda78202 commented Nov 22, 2024 •

edited

Loading

tuanlda78202 commented Nov 28, 2024 •

edited

Loading