Skip to content

pinyintokenizer, 拼音分词器,将连续的拼音切分为单字拼音列表。

License

Notifications You must be signed in to change notification settings

shibing624/pinyin-tokenizer

Repository files navigation

PyPI version Downloads Contributions welcome GitHub contributors License Apache 2.0 python_vesion GitHub issues Wechat Group

Pinyin Tokenizer

pinyin tokenizer(拼音分词器),将连续的拼音切分为单字拼音列表,开箱即用。python3开发。

Guide

Feature

  • 基于前缀树(PyTrie)高效快速把连续拼音切分为单字拼音列表,便于后续拼音转汉字等处理。

Install

  • Requirements and Installation
pip install pinyintokenizer

or

git clone https://github.com/shibing624/pinyin-tokenizer.git
cd pinyin-tokenizer
python setup.py install

Usage

拼音切分(Pinyin Tokenizer)

example:examples/pinyin_tokenize_demo.py:

import sys

sys.path.append('..')
from pinyintokenizer import PinyinTokenizer

if __name__ == '__main__':
    m = PinyinTokenizer()
    print(f"{m.tokenize('wo3')}")
    print(f"{m.tokenize('nihao')}")
    print(f"{m.tokenize('lv3you2')}")  # 旅游
    print(f"{m.tokenize('liudehua')}")
    print(f"{m.tokenize('liu de hua')}")  # 刘德华
    print(f"{m.tokenize('womenzuogelvyougongnue')}")  # 我们做个旅游攻略
    print(f"{m.tokenize('xi anjiaotongdaxue')}")  # 西安交通大学

    # not support english
    print(f"{m.tokenize('good luck')}")

output:

(['wo'], ['3'])
(['ni', 'hao'], [])
(['lv', 'you'], ['3', '2'])
(['liu', 'de', 'hua'], [])
(['liu', 'de', 'hua'], [' ', ' '])
(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])
(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])
(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])
  • tokenize方法返回两个结果,第一个为拼音列表,第二个为非法拼音列表。

连续拼音转汉字(Pinyin2Hanzi)

先使用本库pinyintokenizer把连续拼音切分,再使用Pinyin2Hanzi库把拼音转汉字。

example:examples/pinyin2hanzi_demo.py:

import sys
from Pinyin2Hanzi import DefaultDagParams
from Pinyin2Hanzi import dag

sys.path.append('..')
from pinyintokenizer import PinyinTokenizer

dagparams = DefaultDagParams()


def pinyin2hanzi(pinyin_sentence):
    pinyin_list, _ = PinyinTokenizer().tokenize(pinyin_sentence)
    result = dag(dagparams, pinyin_list, path_num=1)
    return ''.join(result[0].path)


if __name__ == '__main__':
    print(f"{pinyin2hanzi('wo3')}")
    print(f"{pinyin2hanzi('jintianxtianqibucuo')}")
    print(f"{pinyin2hanzi('liudehua')}")

output:

我
今天天气不错
刘德华

Contact

  • Issue(建议):GitHub issues
  • 邮件我:xuming: xuming624@qq.com
  • 微信我:加我微信号:xuming624, 进Python-NLP交流群,备注:姓名-公司名-NLP

Citation

如果你在研究中使用了pinyin-tokenizer,请按如下格式引用:

APA:

Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer

BibTeX:

@misc{pinyin-tokenizer,
  title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},
  author={Xu Ming},
  year={2022},
  howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}},
}

License

授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加pinyin-tokenizer的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

Related Projects

About

pinyintokenizer, 拼音分词器,将连续的拼音切分为单字拼音列表。

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages