Skip to content

Dyma'r sgriptiau a ddefnyddir gan Uned Technolegau Iaith ar gyfer creu setiau hyfforddi a phrofi amgen o gorpws Mozilla Common Voice. // These are the scripts used by the Language Technology Unit for creating alternative training and test sets from the Mozilla Common Voice corpus.

License

Notifications You must be signed in to change notification settings

techiaith/docker-commonvoice-custom-splits-builder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Common Voice Custom Splits Builder

Click here to read this in English

Cyflwyniad

Dyma'r sgriptiau a ddefnyddir gan Uned Technolegau Iaith ar gyfer creu setiau hyfforddi a phrofi amgen o gorpws Mozilla Common Voice. Mae'r sgriptiau wedi eu datblygu yn bennaf ar gyfer modelau Cymraeg (a Saesneg) ond dyle bod modd i'w haddasu ar gyfer ieithoedd eraill.

Sut i'w ddefnyddio

Mae angen cyfrifiadur Mac neu Linux gyda Docker wedi'i osod er mwyn defnyddio'r sgriptiau hyn.

Byddwch angen llwytho i lawr setiau Cymraeg a Saesneg o wefan Common Voice a'i osod ar weinydd HTTP (yn delfrydol un lleol a phreifat) eich hunain. Mae angen rhoi'r cyfeiriadau a rhai manylion eraill o fewn ffeil python/data_urls.py - gellir copio ac addasu'r ffeil python/data_urls.template.py.

Wedi i chi osod yr uchod i gyd yn eu le, yna mae modd adeiladu'r amgylchedd drwy...

$ make

$ make run

O fewn yr amgylchedd docker, rhedwch y gorchymyn canlynol i llwytho i lawr setiau cyfan Common Voice.

# python3 download_commonvoice.py --target_dir /data/download_commonvoice

Bydd data Common Voice i'w weld yn /data/commonvoice yn ogystal â ffeil cv.db sy'n cynnwys metadata'r oll ffeiliau.

Adeiladu setiau amgen

I adeiladu setiau modelau adnabod lleferydd Cymraeg, dylid defnyddio:

# python3 build.py

I adeiladu setiau hyfforddi modelau ddwyieithog, dylid defnyddio:

# python build_biling.py

Mae'r ddwy ffeil sgript uchod yn creu ffeiliau .tsv newydd o fewn eich ffolder Common voice o dan /data

Mae modd creu ffeil .tar.gz eich hunain i gynnwys y ffeiliau .tsv ac mp3 . E.e.

/data/commonvoice/CV11_CY# tar zcvf cv-corpus-11.0-2022-09-21-cy.tar.gz cv-corpus-11.0-2022-09-21/

Introduction

A Mac or Linux computer with Docker installed is required to use these scripts.

You will need to download the Welsh and English sets from the Common Voice website and install it on your own (ideally local and private) HTTP server. The addresses and some other details need to be entered within a python/data_urls.py file - the python/data_urls.template.py file can be copied and modified.

Once you have all the above in place, then the environment can be built by...

$ make

$ make run

Within the docker environment, run the following command to download the entire Common Voice sets.

# python3 download_commonvoice.py --target_dir /data/download_commonvoice

Common Voice data will be found in /data/commonvoice as well as a cv.db file which contains the metadata of all the files.

Build alternative sets

To build sets of Welsh speech recognition models, you should use:

# python3 build.py

To build training sets for bilingual models, use should be made of:

# python build_billing.py

The two script files above create new .tsv files within your Common voice folder under /data

It is possible to create your own .tar.gz file to contain the .tsv and mp3 files. E.g.

/data/commonvoice/CV11_CY# tar zcvf cv-corpus-11.0-2022-09-21-cy.tar.gz cv-corpus-11.0-2022-09-21/

About

Dyma'r sgriptiau a ddefnyddir gan Uned Technolegau Iaith ar gyfer creu setiau hyfforddi a phrofi amgen o gorpws Mozilla Common Voice. // These are the scripts used by the Language Technology Unit for creating alternative training and test sets from the Mozilla Common Voice corpus.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published