Click here to read this in English
Dyma'r sgriptiau a ddefnyddir gan Uned Technolegau Iaith ar gyfer creu setiau hyfforddi a phrofi amgen o gorpws Mozilla Common Voice. Mae'r sgriptiau wedi eu datblygu yn bennaf ar gyfer modelau Cymraeg (a Saesneg) ond dyle bod modd i'w haddasu ar gyfer ieithoedd eraill.
Mae angen cyfrifiadur Mac neu Linux gyda Docker wedi'i osod er mwyn defnyddio'r sgriptiau hyn.
Byddwch angen llwytho i lawr setiau Cymraeg a Saesneg o wefan Common Voice a'i osod ar weinydd HTTP (yn delfrydol un lleol a phreifat) eich hunain. Mae angen rhoi'r cyfeiriadau a rhai manylion eraill o fewn ffeil python/data_urls.py
- gellir copio ac addasu'r ffeil python/data_urls.template.py
.
Wedi i chi osod yr uchod i gyd yn eu le, yna mae modd adeiladu'r amgylchedd drwy...
$ make
$ make run
O fewn yr amgylchedd docker, rhedwch y gorchymyn canlynol i llwytho i lawr setiau cyfan Common Voice.
# python3 download_commonvoice.py --target_dir /data/download_commonvoice
Bydd data Common Voice i'w weld yn /data/commonvoice
yn ogystal â ffeil cv.db
sy'n cynnwys metadata'r oll ffeiliau.
I adeiladu setiau modelau adnabod lleferydd Cymraeg, dylid defnyddio:
# python3 build.py
I adeiladu setiau hyfforddi modelau ddwyieithog, dylid defnyddio:
# python build_biling.py
Mae'r ddwy ffeil sgript uchod yn creu ffeiliau .tsv
newydd o fewn eich ffolder Common voice o dan /data
Mae modd creu ffeil .tar.gz
eich hunain i gynnwys y ffeiliau .tsv
ac mp3 . E.e.
/data/commonvoice/CV11_CY# tar zcvf cv-corpus-11.0-2022-09-21-cy.tar.gz cv-corpus-11.0-2022-09-21/
A Mac or Linux computer with Docker installed is required to use these scripts.
You will need to download the Welsh and English sets from the Common Voice website and install it on your own (ideally local and private) HTTP server. The addresses and some other details need to be entered within a python/data_urls.py
file - the python/data_urls.template.py
file can be copied and modified.
Once you have all the above in place, then the environment can be built by...
$ make
$ make run
Within the docker environment, run the following command to download the entire Common Voice sets.
# python3 download_commonvoice.py --target_dir /data/download_commonvoice
Common Voice data will be found in /data/commonvoice
as well as a cv.db
file which contains the metadata of all the files.
To build sets of Welsh speech recognition models, you should use:
# python3 build.py
To build training sets for bilingual models, use should be made of:
# python build_billing.py
The two script files above create new .tsv
files within your Common voice folder under /data
It is possible to create your own .tar.gz
file to contain the .tsv
and mp3 files. E.g.
/data/commonvoice/CV11_CY# tar zcvf cv-corpus-11.0-2022-09-21-cy.tar.gz cv-corpus-11.0-2022-09-21/