TUBELEX is a multi-lingual YouTube subtitle corpus. It currently provides data for Chinese, English, Indonesian, Japanese, and Spanish.
Word frequency in TUBELEX provides an approximation of everyday languages exposure comparable to, and often better than other resources, such as written corpora, Wikipedia, or film subtitle corpora (SUBTLEX, OpenSubtitles).
You may find TUBELEX useful for NLP applications modeling human familiarity with words, e.g. readability, text simplification, or language learning applications. TUBELEX log-frequencies are highly correlated with psycholinguistic data (lexical decision time, word familiarity) and lexical complexity.
Read our paper for more details:
@inproceedings{nohejl-etal-2025-beyond,
title={Beyond {{Film Subtitles}}: {{Is YouTube}} the {{Best Approximation}} of {{Spoken Vocabulary}}?},
author={Nohejl, Adam and Hudi, Frederikus and Kardinata, Eunike Andriani and Ozaki, Shintaro and Riera Machin, Maria Angelica and Sun, Hongyu and Vasselli, Justin and Watanabe, Taro},
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
year = "2025", url = "https://aclanthology.org/2025.coling-main.641/"
}
This repository provides full source code for the project and word frequency lists. We also provide the following models on Hugging Face Hub:
Note that the full text of the corpus cannot be published for copyright reasons. To enable use of TUBELEX in a wide range of applications, we offer frequency lists in multiple variants and the two above-mentioned types of basic language models. The frequency lists also include frequencies by video category, and dispersion (range or “contextual diversity”).
Frequency lists in the default tokenization and normalization for the impatient:
All TUBELEX frequency files are TSV files compressed with LZMA (xz
) with the following columns:
word
– the word (see below for tokenization/lemmatization and normalization),count
– number of occurrences of the word,videos
– number of videos containing the word,channels
– number of channels the word occurrs in,count:
C – number of occurrences of the word in the YouTube video category C.
All files also provide a row of totals as the last row ([TOTAL]
).
The columns videos
, channels
provide dispersion information as count of corpus parts, in which each word occurs in. It can also be easily determined for categories using the columns count:
C. This measure of dispersion is called range, contextual diversity, or document frequency. You might want to try the logarithm of channels
as a feature for your model instead of log-frequency. (Learn more in our preprint about dispersion.)
We provide the following word frequency lists described in our paper for each language identified with a 2-letter ISO code L:
- default:
tubelex-
L.tsv.xz
(for L inen
,es
,id
,ja
,zh
; direct links to each above) - base:
tubelex-
L-base-pos.tsv.xz
(for L =ja
) - lemma:
tubelex-
L-lemma-pos.tsv.xz
(for L inen
,es
,id
,ja
) - regex:
tubelex-
L-regex.tsv.xz
(for L inen
,es
,id
)
Note that the lemma
and base
variants contain the majority POS for each lemma (base form) in the additional column pos
.
Additionally, we provide:
- segmentation using UniDic 3.1.0 (instead of
unidic-lite
) for Japanese astubelex-ja-310.tsv.xz
,tubelex-ja-310-lemma-pos.tsv.xz
,tubelex-ja-310-base-pos.tsv.xz
, - Penn Treebank segmentation for English as
tubelex-en-treebank.tsv.xz
, - frequencies with majority POS information for each word for Chinese as
tubelex-zh-pos.tsv.xz
.
All of the above are lowercased and Unicode NFKC normalized (as described in our paper). We also provide variants of the above files with alternative normalizations with filenames with the following suffixes:
- only lowercased:
_lower.tsv.xz
, - only normalized to NFKC:
_nfkc.tsv.xz
, - no normalization:
_no-normalization.tsv.xz
.
We are currently working on:
- extending TUBELEX to more languages,
- acquiring larger corpora for each language,
- acquiring more metadata and information that we could make public,
- investigating dispersion measures based on TUBELEX (preprint).
You can re-construct TUBELEX by following the steps below. By modifying the scripts, it is possible to construct corpora for other languages or with different parameters (larger size, different tokenizations etc.)
-
Install the Git submodule for scraping data from YouTube (forked from JTubeSpeech):
git submodule init && git submodule update
Note that the forked submodule is substantially different from JTubeSpeech, and can run most of the steps in parallel.
-
Install requirements (see requirements.txt). The
unidic
package (as opposed tounidic-lite
) requires an additional installation step:python -m unidic download
-
Scrape manual subtitles. The process consists of several substeps, which we have parallelized using shell scripts and GNU
parallel
. To adjust it to your environment, inspect the shell scripts and change the parameters as necessary. Although we have changed the internal workings of the original JTubeSpeech scripts a little, you may also find their outline of the process helpful.Do the following substeps in the
jtubespeech-subtitles
subdirectory:a. Make search words based on Wikipedia:
bash make_search_words.sh
This will create a file
word/tasks.csv
being created with chunks of the search words lists for the next step.b. Get video IDs by searching for the collected words:
bash obtain_video_id_parallel.sh
This will automatically run Python scripts in parallel (using GNU
parallel
), one for each of your CPUs.c. Prepare tasks for the next step:
bash prepare_tasks_from_obtained.sh
This will create files named
videoid/tasks_enesidjazh_part
XXXXXX, where XXXXXX are numbers from 0 to N - 1 (depending on the number of videos found).d. Retrieve subtitle metadata: As this takes a relatively long time, we have divided this step into many tasks, that you can run (optionally in paralell). Each task can be expected to run a few hours. For i in 0 to N - 1, run the tasks prepared in the previous step:
bash retrieve_subtitle_exists.sh *i*
e. Sample 120.000 subtitle files fulfilling the inclusion criteria for each language:
sample.sh
f. Download the subtitles:
bash download_video_parallel.sh
-
Clean the subtitles, remove duplicates, compute frequencies, and train models. Note that this involves tokenization/lemmatization and creation of all the variants, so it can be a lengthy process. There are two options:
a. Adjust and then run the
make.sh
script, which is based on Slurm.bash make.sh
b. Build individual corpora and frequency files using
tubelex.py
. See for instance the script for Japanese. You can also consult the help and process the files as you see fit:python tubelex.py --help
-
Optionally remove the language identification model, intermediate files, and the downloaded subtitles to save disk space:
rm *.ftz *.zip; rm -r jtubespeech/video
To replicate the experiments in our paper you will need the following files placed in the data
directory. We could not distribute them because their license wasn't clear or didn't allow redistribution:
-
Word GINI files
GINI_en.csv
andGINI_ja.csv
, -
elexicon.csv
file available via word generation form at the English Lexicon Project, -
MELD-SCH.csv
, MELD-SCH database, available online as a supplementary Excel file "ESM 1", converted to UTF-8 CSV (using Excel), -
Clark-BRMIC-2004
, (we use only theClark-BRMIC-2004/cp2004b.txt
file), expanded zip archive, available online as a supplementary material for English norms (Clark and Paivio, 2004), -
en-glasgow.csv
Glasgow norms, available online as a supplementary CSV file "ESM 2", -
es-alonso-oral-freq.tsv
, available online as a supplementary material for Spanish oral frequencies by Alonso et al. 2011, concatenated two “columns” into one and exported to UTF-8 TSV, -
es-guasch.csv
, Spanish norms (Guasch et al., 2014) database, available online as a supplementary Excel file "ESM 1", converted to UTF-8 CSV (using Excel), -
es-moreno-martinez.csv
, Spanish norms (Moreno-Martínez et al., 2014) database, available online as a supplementary Excel file "ESM 1", converted to UTF-8 CSV (using Excel), -
amano-kondo-1999-ntt/*.csv
, CSV files of the tables from the Amano-Kondo NTT database (1999). You can extract the files from the first CD-ROM containing a Windows installer like this:# The CD contains a Win98 installer, which decompresses and installs files on your # computer. The files are a database and a program to browse the database. # Here we just decompress the database (DB0001.MDB) and extract tables from it as CSV. # To do so, you will first need to install two software packages (via brew). # Install 7zz (sevenzip) and mdbtools: brew install sevenzip mdbtools # Decompress: 7zz x CD1/DB0001.MD_ -so > DB0001.MDB # Extract tables as CSV: mkdir amano-kondo-1999-ntt for t in $(mdb-tables DB0001.MDB); do mdb-export DB0001.MDB $t > "amano-kondo-1999-ntt/${t}.csv"; done
-
subimdb.tsv
file, which you can generate by first downloading and extracting the SubIMDB corpus into theSubIMDB_All_Individual
directory, and then compiling the frequency list with the following command:python tubelex.py --lang en --frequencies --tokenized-files SubIMDB_All_Individual/subtitles -o data/subimdb.tsv
-
laborotvspeech.tsv
file, which you can generate by first downloading and extracting the LaboroTVSpeech and LaboroTVSpeech2 (both are free for academic use; you do not need to extract the*.wav
files) aslaborotvspeech/LaboroTVSpeech_v1.0b
andlaborotvspeech/LaboroTVSpeech_v2.0b
directories, and then compiling the frequency list with the following command:python tubelex.py --lang en --frequencies --laborotv --tokenized-files laborotvspeech -o data/laborotvspeech.tsv
-
hkust-mtcs.tsv
file, which you can generate by first downloading and extracting transcripts of the HKUST/MTSC corpus, into theLDC2005T32
directory, and then compiling the frequency list with the following command:python tubelex.py --lang zh --frequencies --hkust-mtsc --tokenized-files LDC2005T32/hkust_mcts_p1tr/data -o data/hkust-mtsc.tsv
-
espal.tsv
file created by following these steps:- Go to the EsPal website.
- Select "Subtitle Tokens (2012-10-05)". (Phonology doesn't matter.)
- Click "Words to Properties".
- Select "Word Frequency" > "Count"
- For N in 1...5 repeat steps 6 to 8:
-
- Click "File with Items: Choose File" and select the file
data/es-words.
N.txt
.
- Click "File with Items: Choose File" and select the file
-
- Click "Download"
-
- Click "Search Again..."
- Remove UTF-8 BOM (bytes 0xEFBBBF) from each file, and the header line
word\tcnt
from each file except the first one. - Concatenate the edited files to
data/espal.txt
. - Remove lines not containing any count.
- Add
[TOTAL]\t462611693
as the last line (\t
is the tab character). We use a number of other files (e.g. SPALEX, Wikipedia frequencies, SUBTLEX-US, SUBTLEX-ESP), which are either included or downloaded automatically. - Remove trailing tabs from all lines.
- The resulting file should have 35285 lines and 448608 bytes.