Use PTT (bulletin board system (BBS) in Taiwan) and Chinese Wiki corpora to build count-based and prediction-based word embeddings.
The evaluations in similarity/relatedness tasks are better than the other pre-trained word embeddings.
Download
Chinese_word_embedding_count_based
Hyperparameter | Setting |
---|---|
Frequency weighting | SPPMI_k10 |
Window size | 3 |
Dimensions | 700 |
Remove first k dimensions | 6 |
Weighting exponent | 0.5 |
Discover new words | no |
Chinese_word_embedding_CBOW
Hyperparameter | Setting |
---|---|
Window size | 2 |
Dimensions | 500 |
Model | CBOW |
Learning rate | 0.025 |
Sampling rate | 0.00001 |
Negative samples | 2 |
Discover new words | no |
If you use the Chinese word embedding in your works, please cite this paper:
Ying-Ren Chen (2021). Generate coherent text using semantic embedding, common sense templates and Monte-Carlo tree search methods (Master's thesis, National Tsing Hua University, Hsinchu, Taiwan).
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.