An estimate of the relative frequencies of English phonemes. Also, an estimate of the relative frequencies of English phonemes that follow /w/.
Reproducing the work of Doug Blumeyer, I correlated the CMU Pronouncing Dictionary ("CMUdict") and Adam Kilgarriff's unlemmatized frequency list for the British National Corpus to find phoneme frequencies generally. I extended this technique to estimate post-/w/ phoneme frequencies as well.
I used a combination of Python and Unix tools (grep, sed) for text processing.
Also, since there are many steps with various dependencies in between them,
GNU Make was a decent fit for modeling the dependencies -- easy enough to just
run make
after each modification to my code.
As Blumeyer notes, the source datasets have some limitations. CMUdict conflates "schwa with the near-open central vowel" and has "several noticeable errors." Kilgarriff's frequency list has some formatting issues that make it hard to work with words with accents and apostrophes, (at this time, I've completely ignored this issue) including common contractions.
Blumeyer did manual error checking on several hundred of the most common words. I have not done this.
The CMUdict has multiple pronunciations for some words. For these words, I used only the first pronunciation given. It's not clear to me if in these cases the multiple pronunciations are ordered in some way or just ordered arbitrarily.
While the Kilgarriff list is for the British National Corpus, a quick inspection suggests that it uses American pronunciations over British ones.
- Doug Blumeyer, "Relative Frequencies of English Phonemes"
- CMU Pronouncing Dictionary (Local copy at version 0.7b. Retrieved May 28, 2018.)
- Adam Kilgarriff, word frequencies for the BNC (Local copy retrieved May 28, 2018.)