-
Notifications
You must be signed in to change notification settings - Fork 2
In depth Eyra Data Collection Guidelines
Make an sql list of prompts in the chosen language. It would be best if these words cover as many of the sounds in the language as possible. It would also be wise to include the organization's name and names of the people working on this collection in the prompt list. This ensures the speech recognizer will be able to pronounce those entities correctly. When collecting audio recordings from users, it is important to give them a prompt to read.
The following scripts help turn regular text into tokens which can be inserted into the database and used in the Eyra app.
Make_Tokens_From_Text tokens_to_sql.py
Do something similar to add_extra_2016-04-28.sh in order to add the tokens to the database if Eyra is already up and running.
Otherwise, add it to the Eyra software in Backend/db/ and in erase_and_rewind.sql
Now the tokens should be in the database and fetched into the Eyra app.
- Must insert the agreement to the database in the recording_agreement table
- When adding the agreement to the app, in Frontend/da-webapp/src/view/recording-agreement.html, the agreement-id must be changed to reflect the number in the database.
- Label the phones with stickers (e.g. A1, A2, etc) until the devices themselves can be registered and properly tracked within Eyra by the field administrators
- Before recording, users should read the Docs/UserGuideInstructions.pdf
- While recording users should follow the instructions in DataUploadingInstructions.pdf
- After each collection on the same phone, administrators make sure any audio is synced to the database. Also, they verify that there are at least 1000 prompts loaded on the phone. If not, then they go to the admin panel and select Get tokens(dev)
During a collection, administrators could act as evaluators to say if the audio is good or if there need to be changes to the collecting procedure.
- Adults can consent in app if the Consent is set to true.
- Children will need a slightly different procedure
A week after the data collection has started (~500 hrs), run marosijo with kaldi. Then use the app with quality assurance, too.
During and after the collection, people evaluate the quality of the recordings using the Evaluate tab in the menu. There need to be different sets for test sets and training sets since they're evaluated differently. However, this all assumes that the data collection is for a machine learning ASR.
Sanitize the data of any identifying personal details.
- run backup_db_and_recs
- removePersonalInformation can do the majority of the sanitization.
If you want this data to be usable for ASR and TTS, then create a tsv file with the fileid, speakerId, and the prompt. eyraDataToTsv.py will do that and more.
Open source the data on OpenSLR.org and malfong.is. Consider distributing it on other platforms, too.