Skip to content

ogpetrov/sakha-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sakha-nlp

Various tools and data for Sakha language NLP.

NB: I'm not a linguist or NLP engineer. This is just a small open-source pet project I'm working on to benefit my native language. I would welcome any comments and suggestions. Thanks :)

Already available:

  • A Sakha words dataset from the GEDSL (Great Explanatory Dictionary of the Sakha Language, link) with parts of speech, variants, homonyms, and (most useful, I suppose) Russian translations (n=18261). Also included is a parser for the source GEDSL HTML page.
    (visualization of Russian translations embeddings made with SBERT and reduced with UMAP)

  • A simple phonemizer (phonemic analysis tool) for the Sakha language. It uses n-grams and Sakha phonetic rules (phonotactics) to convert Sakha text (in official Cyrillic orthography) to phonemes - either IPA or V/C (vowel/consonant).
  • A brief historical overview of the Novgorodov alphabet (which has become popular lately among Sakha-speaking people), in Russian.

Planned for the future:

  • Expand the phonemizer for comprehensive phonetic analysis and to check phonotactics rules (i.e., spell-checking for strict Sakha language rules).
  • Improve the parser to extract all information from GEDSL. At the moment, it can only parse single words without phrases and word combinations (more than 5k records) and cannot handle the remaining unstructured patterns (approximately 4-5k records).

About

Various tools and data for Sakha language NLP.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages