Accuracy rate seems to be 10% lower than the original version #21

hankcs · 2016-07-20T04:27:03Z

Hello, kojisekig.
Thank you for your nice Java codes. This is the closest version compared to Google's original C version.
But I computed the accuracy rate, and it is 10% lower than the original version.
I trained on text8 with exactly the same parameters, which are:

com.rondhuit.w2v.demo.TextFileCreateVectors -input text8.txt -output vectors.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15

Note I used your com.rondhuit.w2v.Text8Splitter to cut text8 to multilines, I think it does not affect the result, since both implementation's MAX_WORDS are 1000.

Then I translated compute-accuracy.c from Google's C code to Java code, and run the test with the same parameters:

com.rondhuit.w2v.demo.ComputeAccuracy vectors.txt 30000 questions-words.txt
./compute-accuracy vectors.bin 30000 < questions-words.txt

The result is really surprising.
Your Java implementation:

CAPITAL-COMMON-COUNTRIES:
ACCURACY TOP1: 71.15 %  (360 / 506)
Total accuracy: 71.15 %   Semantic accuracy: 71.15 %   Syntactic accuracy: NaN % 
CAPITAL-WORLD:
ACCURACY TOP1: 46.42 %  (674 / 1452)
Total accuracy: 52.81 %   Semantic accuracy: 52.81 %   Syntactic accuracy: NaN % 
CURRENCY:
ACCURACY TOP1: 4.48 %  (12 / 268)
Total accuracy: 46.99 %   Semantic accuracy: 46.99 %   Syntactic accuracy: NaN % 
CITY-IN-STATE:
ACCURACY TOP1: 41.37 %  (650 / 1571)
Total accuracy: 44.67 %   Semantic accuracy: 44.67 %   Syntactic accuracy: NaN % 
FAMILY:
ACCURACY TOP1: 45.42 %  (139 / 306)
Total accuracy: 44.72 %   Semantic accuracy: 44.72 %   Syntactic accuracy: NaN % 
GRAM1-ADJECTIVE-TO-ADVERB:
ACCURACY TOP1: 10.32 %  (78 / 756)
Total accuracy: 39.37 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 10.32 % 
GRAM2-OPPOSITE:
ACCURACY TOP1: 13.40 %  (41 / 306)
Total accuracy: 37.83 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 11.21 % 
GRAM3-COMPARATIVE:
ACCURACY TOP1: 42.46 %  (535 / 1260)
Total accuracy: 38.74 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 28.17 % 
GRAM4-SUPERLATIVE:
ACCURACY TOP1: 18.38 %  (93 / 506)
Total accuracy: 37.25 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 26.41 % 
GRAM5-PRESENT-PARTICIPLE:
ACCURACY TOP1: 26.31 %  (261 / 992)
Total accuracy: 35.88 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 26.39 % 
GRAM6-NATIONALITY-ADJECTIVE:
ACCURACY TOP1: 75.13 %  (1030 / 1371)
Total accuracy: 41.67 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 39.26 % 
GRAM7-PAST-TENSE:
ACCURACY TOP1: 31.53 %  (420 / 1332)
Total accuracy: 40.40 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 37.68 % 
GRAM8-PLURAL:
ACCURACY TOP1: 61.09 %  (606 / 992)
Total accuracy: 42.17 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 40.77 % 
GRAM9-PLURAL-VERBS:
ACCURACY TOP1: 20.62 %  (134 / 650)
Total accuracy: 41.03 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 39.17 % 
Questions seen / total: 12268 19544   62.77 %

Google's C implementation:

capital-common-countries:
ACCURACY TOP1: 82.81 %  (419 / 506)
Total accuracy: 82.81 %   Semantic accuracy: 82.81 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 62.26 %  (904 / 1452)
Total accuracy: 67.57 %   Semantic accuracy: 67.57 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.22 %   Semantic accuracy: 62.22 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 44.68 %  (702 / 1571)
Total accuracy: 54.96 %   Semantic accuracy: 54.96 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 75.82 %  (232 / 306)
Total accuracy: 56.52 %   Semantic accuracy: 56.52 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 17.20 %  (130 / 756)
Total accuracy: 50.40 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 17.20 % 
gram2-opposite:
ACCURACY TOP1: 21.90 %  (67 / 306)
Total accuracy: 48.71 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 18.55 % 
gram3-comparative:
ACCURACY TOP1: 64.60 %  (814 / 1260)
Total accuracy: 51.83 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 43.54 % 
gram4-superlative:
ACCURACY TOP1: 39.72 %  (201 / 506)
Total accuracy: 50.95 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 42.86 % 
gram5-present-participle:
ACCURACY TOP1: 39.52 %  (392 / 992)
Total accuracy: 49.51 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 41.99 % 
gram6-nationality-adjective:
ACCURACY TOP1: 87.24 %  (1196 / 1371)
Total accuracy: 55.08 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 53.94 % 
gram7-past-tense:
ACCURACY TOP1: 38.21 %  (509 / 1332)
Total accuracy: 52.96 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 50.73 % 
gram8-plural:
ACCURACY TOP1: 67.54 %  (670 / 992)
Total accuracy: 54.21 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 52.95 % 
gram9-plural-verbs:
ACCURACY TOP1: 37.38 %  (243 / 650)
Total accuracy: 53.32 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 51.71 % 
Questions seen / total: 12268 19544   62.77 %

Can you give me any suggestions or ideas about this? I am ready to help you if needed. I think we both want to make this the best Java word2vec!

You can re-run this test after merging my pull request #20 .

Thank you!

The text was updated successfully, but these errors were encountered:

kojisekig · 2016-07-22T09:54:02Z

Hi Hancks! Thank you for your feed back.

I implemented this almost 2 years ago and I forgot details. I used Lucene at some points and when I did them, I had some compromise, and the result cannot be same. But as you kindly reported, the results were not good.

hankcs · 2016-07-22T10:04:52Z

Thank you for your reply. This implementation is the best one in Java, since the others yield worse accuracy rates.

I will look into your early commits and compare it with the original C code carefully. There must be some different.

kojisekig · 2016-07-23T03:44:12Z

Thanks for you comment again.

I don't think I can take a proactive action about this issue because I'm in the current project and don't have time, but I'm happy to help you if you find some different and ask me.

I'd like to do my best to remember why I implemented in different way and will improve them, if possible.

Keep in touch!

hankcs · 2017-12-09T03:55:45Z

Hi Mr. Sekiguchi,

After a long time, a friend @tiandiweizun and me finally find the difference between this version and Google's. The reason of difference scores is that the parsing logics of command line are different.

When performing -hs 0, users want to turn HierarchicalSoftmax off, but your code actually activates it, no matter 0 or 1 follows -hs. This logic is different with Google's. After fixing it, we find that these two versions share similar accuracy.

I've submitted a pull request, you may consider merging it for your convenience.

Thank you.

tiandiweizun mentioned this issue Nov 29, 2017

word2vector准确率测试，貌似和C没有什么区别 hankcs/HanLP#699

Closed

1 task

hankcs mentioned this issue Dec 9, 2017

make command line interface compatible with Google's C code #22

Merged

hankcs closed this as completed Dec 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy rate seems to be 10% lower than the original version #21

Accuracy rate seems to be 10% lower than the original version #21

hankcs commented Jul 20, 2016

kojisekig commented Jul 22, 2016

hankcs commented Jul 22, 2016

kojisekig commented Jul 23, 2016

hankcs commented Dec 9, 2017 •

edited

Loading

Accuracy rate seems to be 10% lower than the original version #21

Accuracy rate seems to be 10% lower than the original version #21

Comments

hankcs commented Jul 20, 2016

kojisekig commented Jul 22, 2016

hankcs commented Jul 22, 2016

kojisekig commented Jul 23, 2016

hankcs commented Dec 9, 2017 • edited Loading

hankcs commented Dec 9, 2017 •

edited

Loading