-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
word2vector准确率测试,貌似和C没有什么区别 #699
Labels
Comments
感谢反馈,非常有价值的测试。
|
我又按照gensim默认的30000测试了一下,由于word2vec的c版本默认第一个是“/s”,所以我还测试30001的情况,发现结果没有任何区别。以下是测试结果。
我发现我测试的52.7%(google_c ,hs=0)和你测试的53.32 %(google_c ,hs=0)差别不大,而我测试的40.80%(hanlp,hs=1)和你测试的41.03 %(kojisekig/word2vec-lucene,hs=0) 差别不大,猜测是由于参数配置不同导致的。果断看了一下他的源码,印证了我的猜想。 结论:java配置模块代码和c版本非完全一致导致。 感想:结论如此简单,然而我却看了好几个源码,做了无数测试,虽然我早就发现了问题,但是出于对我女王的崇拜,没有太多思虑。 |
TylunasLi
pushed a commit
to TylunasLi/HanLP
that referenced
this issue
Dec 30, 2017
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
注意事项
请确认下列注意事项:
版本号
当前最新版本号是:1.5.2
我使用的版本是:1.5.2
我的问题
对于1,源码中参数只要发现有cbow和hs,就直接设为true,无关0与1的值,所以当测试了hs=0的时候,其实hanlp使用hs,而c版本没有,在《word2vec原理推导与代码分析》中尽管参数一样,但实际训练过程不一样,不知道这是不是造成准确率差别比较大的原因。我分别测试了hanlp在hs=1和没有添加hs这个参数时的准确率。
对于2,对于c版本,采用的c进行训练,gensim计算accuracy,我看过源码和跑过c的accuracy,两个结果一致,没有问题,但是gensim的更快,log更清晰,就跑了gensim的。
这是测试结果:比《Accuracy rate seems to be 10% lower than the original version》中的c低了10%,不知道为什么?
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 17:29:30,471 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,375 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,436 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 17:29:46,578 : INFO : capital-common-countries: 77.5% (392/506)
2017-11-28 17:30:15,301 : INFO : capital-world: 45.6% (1626/3564)
2017-11-28 17:30:20,082 : INFO : currency: 19.5% (116/596)
2017-11-28 17:30:38,799 : INFO : city-in-state: 41.2% (959/2330)
2017-11-28 17:30:42,157 : INFO : family: 61.7% (259/420)
2017-11-28 17:30:50,121 : INFO : gram1-adjective-to-adverb: 13.8% (137/992)
2017-11-28 17:30:56,214 : INFO : gram2-opposite: 13.1% (99/756)
2017-11-28 17:31:07,010 : INFO : gram3-comparative: 60.6% (807/1332)
2017-11-28 17:31:14,960 : INFO : gram4-superlative: 25.0% (248/992)
2017-11-28 17:31:23,447 : INFO : gram5-present-participle: 38.6% (408/1056)
2017-11-28 17:31:35,607 : INFO : gram6-nationality-adjective: 77.6% (1181/1521)
2017-11-28 17:31:48,147 : INFO : gram7-past-tense: 34.8% (543/1560)
2017-11-28 17:31:58,815 : INFO : gram8-plural: 49.5% (659/1332)
2017-11-28 17:32:05,812 : INFO : gram9-plural-verbs: 30.8% (268/870)
2017-11-28 17:32:05,812 : INFO : total: 43.2% (7702/17827)
:gensim
model = word2vec.Word2Vec(sentences, size=200, window=8, negative=25, hs=1, sample=0.0001, workers=8, iter=15)
2017-11-29 11:49:46,647 : INFO : loading projection weights from E:/data/word2vec/text8.gensim.word2vec.txt
2017-11-29 11:50:00,520 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.gensim.word2vec.txt
2017-11-29 11:50:00,599 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:50:04,786 : INFO : capital-common-countries: 76.5% (387/506)
2017-11-29 11:50:33,871 : INFO : capital-world: 37.9% (1349/3564)
2017-11-29 11:50:38,687 : INFO : currency: 7.0% (42/596)
2017-11-29 11:50:57,526 : INFO : city-in-state: 40.4% (942/2330)
2017-11-29 11:51:01,313 : INFO : family: 47.4% (199/420)
2017-11-29 11:51:09,776 : INFO : gram1-adjective-to-adverb: 10.8% (107/992)
2017-11-29 11:51:16,038 : INFO : gram2-opposite: 9.0% (68/756)
2017-11-29 11:51:26,976 : INFO : gram3-comparative: 51.4% (685/1332)
2017-11-29 11:51:34,859 : INFO : gram4-superlative: 19.8% (196/992)
2017-11-29 11:51:43,236 : INFO : gram5-present-participle: 25.5% (269/1056)
2017-11-29 11:51:55,519 : INFO : gram6-nationality-adjective: 73.0% (1111/1521)
2017-11-29 11:52:07,953 : INFO : gram7-past-tense: 35.5% (554/1560)
2017-11-29 11:52:18,648 : INFO : gram8-plural: 49.2% (655/1332)
2017-11-29 11:52:25,628 : INFO : gram9-plural-verbs: 21.8% (190/870)
2017-11-29 11:52:25,628 : INFO : total: 37.9% (6754/17827)
model = word2vec.Word2Vec(sentences, size=200, window=8, negative=25, hs=0, sample=0.0001, workers=8, iter=15)
2017-11-29 11:53:14,415 : INFO : loading projection weights from E:/data/word2vec/text8.gensim.word2vec.txt_1
2017-11-29 11:53:27,427 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.gensim.word2vec.txt_1
2017-11-29 11:53:27,505 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:53:31,894 : INFO : capital-common-countries: 72.9% (369/506)
2017-11-29 11:54:01,937 : INFO : capital-world: 51.1% (1822/3564)
2017-11-29 11:54:06,974 : INFO : currency: 18.0% (107/596)
2017-11-29 11:54:26,329 : INFO : city-in-state: 41.5% (966/2330)
2017-11-29 11:54:29,640 : INFO : family: 59.3% (249/420)
2017-11-29 11:54:37,565 : INFO : gram1-adjective-to-adverb: 14.3% (142/992)
2017-11-29 11:54:43,559 : INFO : gram2-opposite: 13.6% (103/756)
2017-11-29 11:54:54,144 : INFO : gram3-comparative: 64.3% (857/1332)
2017-11-29 11:55:02,068 : INFO : gram4-superlative: 23.1% (229/992)
2017-11-29 11:55:10,453 : INFO : gram5-present-participle: 36.0% (380/1056)
2017-11-29 11:55:22,509 : INFO : gram6-nationality-adjective: 73.7% (1121/1521)
2017-11-29 11:55:34,861 : INFO : gram7-past-tense: 34.3% (535/1560)
2017-11-29 11:55:45,290 : INFO : gram8-plural: 49.8% (664/1332)
2017-11-29 11:55:52,154 : INFO : gram9-plural-verbs: 31.5% (274/870)
2017-11-29 11:55:52,155 : INFO : total: 43.9% (7818/17827)
:hanlp
-input E:\data\word2vec\text8 -output E:\data\word2vec\text8.hanlp.word2vec.txt -size 200 -window 8 -negative 25 -hs 0 -cbow 1 -sample 1e-4 -threads 8 -binary 1 -iter 15
2017-11-28 16:53:03,293 : INFO : loading projection weights from E:/data/word2vec/text8.hanlp.word2vec.txt
2017-11-28 16:53:15,493 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.hanlp.word2vec.txt
2017-11-28 16:53:15,553 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:53:19,831 : INFO : capital-common-countries: 69.8% (353/506)
2017-11-28 16:53:49,194 : INFO : capital-world: 30.3% (1079/3564)
2017-11-28 16:53:54,053 : INFO : currency: 4.9% (29/596)
2017-11-28 16:54:12,895 : INFO : city-in-state: 35.7% (831/2330)
2017-11-28 16:54:16,322 : INFO : family: 31.9% (134/420)
2017-11-28 16:54:24,401 : INFO : gram1-adjective-to-adverb: 7.7% (76/992)
2017-11-28 16:54:30,487 : INFO : gram2-opposite: 9.9% (75/756)
2017-11-28 16:54:41,328 : INFO : gram3-comparative: 38.3% (510/1332)
2017-11-28 16:54:49,278 : INFO : gram4-superlative: 13.5% (134/992)
2017-11-28 16:54:58,219 : INFO : gram5-present-participle: 21.6% (228/1056)
2017-11-28 16:55:10,444 : INFO : gram6-nationality-adjective: 72.4% (1101/1521)
2017-11-28 16:55:22,950 : INFO : gram7-past-tense: 28.5% (445/1560)
2017-11-28 16:55:33,730 : INFO : gram8-plural: 45.9% (612/1332)
2017-11-28 16:55:40,694 : INFO : gram9-plural-verbs: 17.1% (149/870)
2017-11-28 16:55:40,696 : INFO : total: 32.3% (5756/17827)
-input E:\data\word2vec\text8 -output E:\data\word2vec\text8.hanlp.word2vec.txt_1 -size 200 -window 8 -negative 25 -cbow 1 -sample 1e-4 -threads 8 -binary 1 -iter 15
2017-11-29 11:15:27,628 : INFO : loading projection weights from E:/data/word2vec/text8.hanlp.word2vec.txt_1
2017-11-29 11:15:42,361 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.hanlp.word2vec.txt_1
2017-11-29 11:15:42,461 : INFO : precomputing L2-norms of word weight vectors
2017-11-29 11:15:47,365 : INFO : capital-common-countries: 80.0% (405/506)
2017-11-29 11:16:20,013 : INFO : capital-world: 46.2% (1647/3564)
2017-11-29 11:16:25,338 : INFO : currency: 14.4% (86/596)
2017-11-29 11:16:46,128 : INFO : city-in-state: 46.4% (1081/2330)
2017-11-29 11:16:49,861 : INFO : family: 53.1% (223/420)
2017-11-29 11:16:58,723 : INFO : gram1-adjective-to-adverb: 15.7% (156/992)
2017-11-29 11:17:05,424 : INFO : gram2-opposite: 9.9% (75/756)
2017-11-29 11:17:17,216 : INFO : gram3-comparative: 51.1% (680/1332)
2017-11-29 11:17:26,082 : INFO : gram4-superlative: 20.0% (198/992)
2017-11-29 11:17:35,536 : INFO : gram5-present-participle: 29.9% (316/1056)
2017-11-29 11:17:49,177 : INFO : gram6-nationality-adjective: 82.4% (1254/1521)
2017-11-29 11:18:03,059 : INFO : gram7-past-tense: 32.5% (507/1560)
2017-11-29 11:18:15,029 : INFO : gram8-plural: 53.7% (715/1332)
2017-11-29 11:18:22,894 : INFO : gram9-plural-verbs: 26.7% (232/870)
2017-11-29 11:18:22,894 : INFO : total: 42.5% (7575/17827)
:deeplearning4j
Word2Vec vec = new Word2Vec.Builder().layerSize(200).windowSize(8).negativeSample(25).minWordFrequency(5).useHierarchicSoftmax(true).sampling(0.0001).workers(8).iterations(15).epochs(15).iterate(iter)
.elementsLearningAlgorithm("org.deeplearning4j.models.embeddings.learning.impl.elements.CBOW")
.tokenizerFactory(t)
.build();
2017-11-28 16:46:26,894 : INFO : loading projection weights from E:/data/word2vec/text8.deeplearning4j.word2vec.txt
2017-11-28 16:46:39,391 : INFO : loaded (71290L, 200L) matrix from E:/data/word2vec/text8.deeplearning4j.word2vec.txt
2017-11-28 16:46:39,453 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:46:43,596 : INFO : capital-common-countries: 67.4% (341/506)
2017-11-28 16:47:12,592 : INFO : capital-world: 33.9% (1208/3564)
2017-11-28 16:47:17,515 : INFO : currency: 6.0% (36/596)
2017-11-28 16:47:36,332 : INFO : city-in-state: 36.6% (852/2330)
2017-11-28 16:47:39,834 : INFO : family: 38.3% (161/420)
2017-11-28 16:47:47,898 : INFO : gram1-adjective-to-adverb: 9.0% (89/992)
2017-11-28 16:47:53,953 : INFO : gram2-opposite: 7.0% (53/756)
2017-11-28 16:48:04,632 : INFO : gram3-comparative: 38.7% (515/1332)
2017-11-28 16:48:12,653 : INFO : gram4-superlative: 11.8% (117/992)
2017-11-28 16:48:21,220 : INFO : gram5-present-participle: 23.0% (243/1056)
2017-11-28 16:48:33,519 : INFO : gram6-nationality-adjective: 76.7% (1166/1521)
2017-11-28 16:48:46,165 : INFO : gram7-past-tense: 27.2% (424/1560)
2017-11-28 16:48:56,894 : INFO : gram8-plural: 48.2% (642/1332)
2017-11-28 16:49:03,973 : INFO : gram9-plural-verbs: 19.2% (167/870)
2017-11-28 16:49:03,974 : INFO : total: 33.7% (6014/17827)
:google_c
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 1 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 16:49:29,132 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt
2017-11-28 16:49:41,848 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt
2017-11-28 16:49:41,914 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 16:49:46,154 : INFO : capital-common-countries: 75.7% (383/506)
2017-11-28 16:50:15,078 : INFO : capital-world: 33.2% (1184/3564)
2017-11-28 16:50:19,993 : INFO : currency: 6.0% (36/596)
2017-11-28 16:50:38,967 : INFO : city-in-state: 36.0% (838/2330)
2017-11-28 16:50:42,348 : INFO : family: 47.4% (199/420)
2017-11-28 16:50:50,315 : INFO : gram1-adjective-to-adverb: 10.6% (105/992)
2017-11-28 16:50:56,355 : INFO : gram2-opposite: 7.8% (59/756)
2017-11-28 16:51:07,065 : INFO : gram3-comparative: 48.3% (644/1332)
2017-11-28 16:51:14,905 : INFO : gram4-superlative: 18.0% (179/992)
2017-11-28 16:51:23,299 : INFO : gram5-present-participle: 29.0% (306/1056)
2017-11-28 16:51:35,345 : INFO : gram6-nationality-adjective: 70.1% (1066/1521)
2017-11-28 16:51:47,733 : INFO : gram7-past-tense: 31.9% (498/1560)
2017-11-28 16:51:58,316 : INFO : gram8-plural: 50.1% (667/1332)
2017-11-28 16:52:05,321 : INFO : gram9-plural-verbs: 20.0% (174/870)
2017-11-28 16:52:05,322 : INFO : total: 35.6% (6338/17827)
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 0 -iter 15
2017-11-28 17:29:30,471 : INFO : loading projection weights from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,375 : INFO : loaded (71291L, 200L) matrix from E:/data/word2vec/text8.google_c.word2vec.txt_1
2017-11-28 17:29:42,436 : INFO : precomputing L2-norms of word weight vectors
2017-11-28 17:29:46,578 : INFO : capital-common-countries: 77.5% (392/506)
2017-11-28 17:30:15,301 : INFO : capital-world: 45.6% (1626/3564)
2017-11-28 17:30:20,082 : INFO : currency: 19.5% (116/596)
2017-11-28 17:30:38,799 : INFO : city-in-state: 41.2% (959/2330)
2017-11-28 17:30:42,157 : INFO : family: 61.7% (259/420)
2017-11-28 17:30:50,121 : INFO : gram1-adjective-to-adverb: 13.8% (137/992)
2017-11-28 17:30:56,214 : INFO : gram2-opposite: 13.1% (99/756)
2017-11-28 17:31:07,010 : INFO : gram3-comparative: 60.6% (807/1332)
2017-11-28 17:31:14,960 : INFO : gram4-superlative: 25.0% (248/992)
2017-11-28 17:31:23,447 : INFO : gram5-present-participle: 38.6% (408/1056)
2017-11-28 17:31:35,607 : INFO : gram6-nationality-adjective: 77.6% (1181/1521)
2017-11-28 17:31:48,147 : INFO : gram7-past-tense: 34.8% (543/1560)
2017-11-28 17:31:58,815 : INFO : gram8-plural: 49.5% (659/1332)
2017-11-28 17:32:05,812 : INFO : gram9-plural-verbs: 30.8% (268/870)
2017-11-28 17:32:05,812 : INFO : total: 43.2% (7702/17827)
The text was updated successfully, but these errors were encountered: