Skip to content

prendradjaja/phoneme-frequencies

Repository files navigation

Quick links

Summary

An estimate of the relative frequencies of English phonemes. Also, an estimate of the relative frequencies of English phonemes that follow /w/.

Methodology

Reproducing the work of Doug Blumeyer, I correlated the CMU Pronouncing Dictionary ("CMUdict") and Adam Kilgarriff's unlemmatized frequency list for the British National Corpus to find phoneme frequencies generally. I extended this technique to estimate post-/w/ phoneme frequencies as well.

Tools used

I used a combination of Python and Unix tools (grep, sed) for text processing.

Also, since there are many steps with various dependencies in between them, GNU Make was a decent fit for modeling the dependencies -- easy enough to just run make after each modification to my code.

Limitations

As Blumeyer notes, the source datasets have some limitations. CMUdict conflates "schwa with the near-open central vowel" and has "several noticeable errors." Kilgarriff's frequency list has some formatting issues that make it hard to work with words with accents and apostrophes, (at this time, I've completely ignored this issue) including common contractions.

Blumeyer did manual error checking on several hundred of the most common words. I have not done this.

The CMUdict has multiple pronunciations for some words. For these words, I used only the first pronunciation given. It's not clear to me if in these cases the multiple pronunciations are ordered in some way or just ordered arbitrarily.

Other notes

While the Kilgarriff list is for the British National Corpus, a quick inspection suggests that it uses American pronunciations over British ones.

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published