Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy rate seems to be 10% lower than the original version #21

Closed
hankcs opened this issue Jul 20, 2016 · 4 comments
Closed

Accuracy rate seems to be 10% lower than the original version #21

hankcs opened this issue Jul 20, 2016 · 4 comments

Comments

@hankcs
Copy link
Contributor

hankcs commented Jul 20, 2016

Hello, kojisekig.
Thank you for your nice Java codes. This is the closest version compared to Google's original C version.
But I computed the accuracy rate, and it is 10% lower than the original version.
I trained on text8 with exactly the same parameters, which are:

com.rondhuit.w2v.demo.TextFileCreateVectors -input text8.txt -output vectors.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15
./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15

Note I used your com.rondhuit.w2v.Text8Splitter to cut text8 to multilines, I think it does not affect the result, since both implementation's MAX_WORDS are 1000.

Then I translated compute-accuracy.c from Google's C code to Java code, and run the test with the same parameters:

com.rondhuit.w2v.demo.ComputeAccuracy vectors.txt 30000 questions-words.txt
./compute-accuracy vectors.bin 30000 < questions-words.txt

The result is really surprising.
Your Java implementation:

CAPITAL-COMMON-COUNTRIES:
ACCURACY TOP1: 71.15 %  (360 / 506)
Total accuracy: 71.15 %   Semantic accuracy: 71.15 %   Syntactic accuracy: NaN % 
CAPITAL-WORLD:
ACCURACY TOP1: 46.42 %  (674 / 1452)
Total accuracy: 52.81 %   Semantic accuracy: 52.81 %   Syntactic accuracy: NaN % 
CURRENCY:
ACCURACY TOP1: 4.48 %  (12 / 268)
Total accuracy: 46.99 %   Semantic accuracy: 46.99 %   Syntactic accuracy: NaN % 
CITY-IN-STATE:
ACCURACY TOP1: 41.37 %  (650 / 1571)
Total accuracy: 44.67 %   Semantic accuracy: 44.67 %   Syntactic accuracy: NaN % 
FAMILY:
ACCURACY TOP1: 45.42 %  (139 / 306)
Total accuracy: 44.72 %   Semantic accuracy: 44.72 %   Syntactic accuracy: NaN % 
GRAM1-ADJECTIVE-TO-ADVERB:
ACCURACY TOP1: 10.32 %  (78 / 756)
Total accuracy: 39.37 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 10.32 % 
GRAM2-OPPOSITE:
ACCURACY TOP1: 13.40 %  (41 / 306)
Total accuracy: 37.83 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 11.21 % 
GRAM3-COMPARATIVE:
ACCURACY TOP1: 42.46 %  (535 / 1260)
Total accuracy: 38.74 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 28.17 % 
GRAM4-SUPERLATIVE:
ACCURACY TOP1: 18.38 %  (93 / 506)
Total accuracy: 37.25 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 26.41 % 
GRAM5-PRESENT-PARTICIPLE:
ACCURACY TOP1: 26.31 %  (261 / 992)
Total accuracy: 35.88 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 26.39 % 
GRAM6-NATIONALITY-ADJECTIVE:
ACCURACY TOP1: 75.13 %  (1030 / 1371)
Total accuracy: 41.67 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 39.26 % 
GRAM7-PAST-TENSE:
ACCURACY TOP1: 31.53 %  (420 / 1332)
Total accuracy: 40.40 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 37.68 % 
GRAM8-PLURAL:
ACCURACY TOP1: 61.09 %  (606 / 992)
Total accuracy: 42.17 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 40.77 % 
GRAM9-PLURAL-VERBS:
ACCURACY TOP1: 20.62 %  (134 / 650)
Total accuracy: 41.03 %   Semantic accuracy: 44.72 %   Syntactic accuracy: 39.17 % 
Questions seen / total: 12268 19544   62.77 % 

Google's C implementation:

capital-common-countries:
ACCURACY TOP1: 82.81 %  (419 / 506)
Total accuracy: 82.81 %   Semantic accuracy: 82.81 %   Syntactic accuracy: nan % 
capital-world:
ACCURACY TOP1: 62.26 %  (904 / 1452)
Total accuracy: 67.57 %   Semantic accuracy: 67.57 %   Syntactic accuracy: nan % 
currency:
ACCURACY TOP1: 23.13 %  (62 / 268)
Total accuracy: 62.22 %   Semantic accuracy: 62.22 %   Syntactic accuracy: nan % 
city-in-state:
ACCURACY TOP1: 44.68 %  (702 / 1571)
Total accuracy: 54.96 %   Semantic accuracy: 54.96 %   Syntactic accuracy: nan % 
family:
ACCURACY TOP1: 75.82 %  (232 / 306)
Total accuracy: 56.52 %   Semantic accuracy: 56.52 %   Syntactic accuracy: nan % 
gram1-adjective-to-adverb:
ACCURACY TOP1: 17.20 %  (130 / 756)
Total accuracy: 50.40 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 17.20 % 
gram2-opposite:
ACCURACY TOP1: 21.90 %  (67 / 306)
Total accuracy: 48.71 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 18.55 % 
gram3-comparative:
ACCURACY TOP1: 64.60 %  (814 / 1260)
Total accuracy: 51.83 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 43.54 % 
gram4-superlative:
ACCURACY TOP1: 39.72 %  (201 / 506)
Total accuracy: 50.95 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 42.86 % 
gram5-present-participle:
ACCURACY TOP1: 39.52 %  (392 / 992)
Total accuracy: 49.51 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 41.99 % 
gram6-nationality-adjective:
ACCURACY TOP1: 87.24 %  (1196 / 1371)
Total accuracy: 55.08 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 53.94 % 
gram7-past-tense:
ACCURACY TOP1: 38.21 %  (509 / 1332)
Total accuracy: 52.96 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 50.73 % 
gram8-plural:
ACCURACY TOP1: 67.54 %  (670 / 992)
Total accuracy: 54.21 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 52.95 % 
gram9-plural-verbs:
ACCURACY TOP1: 37.38 %  (243 / 650)
Total accuracy: 53.32 %   Semantic accuracy: 56.52 %   Syntactic accuracy: 51.71 % 
Questions seen / total: 12268 19544   62.77 %

Can you give me any suggestions or ideas about this? I am ready to help you if needed. I think we both want to make this the best Java word2vec!

You can re-run this test after merging my pull request #20 .

Thank you!

@kojisekig
Copy link
Owner

Hi Hancks! Thank you for your feed back.

I implemented this almost 2 years ago and I forgot details. I used Lucene at some points and when I did them, I had some compromise, and the result cannot be same. But as you kindly reported, the results were not good.

@hankcs
Copy link
Contributor Author

hankcs commented Jul 22, 2016

Thank you for your reply. This implementation is the best one in Java, since the others yield worse accuracy rates.

I will look into your early commits and compare it with the original C code carefully. There must be some different.

@kojisekig
Copy link
Owner

Thanks for you comment again.

I don't think I can take a proactive action about this issue because I'm in the current project and don't have time, but I'm happy to help you if you find some different and ask me.

I'd like to do my best to remember why I implemented in different way and will improve them, if possible.

Keep in touch!

@hankcs
Copy link
Contributor Author

hankcs commented Dec 9, 2017

Hi Mr. Sekiguchi,

After a long time, a friend @tiandiweizun and me finally find the difference between this version and Google's. The reason of difference scores is that the parsing logics of command line are different.

When performing -hs 0, users want to turn HierarchicalSoftmax off, but your code actually activates it, no matter 0 or 1 follows -hs. This logic is different with Google's. After fixing it, we find that these two versions share similar accuracy.

I've submitted a pull request, you may consider merging it for your convenience.

Thank you.

@hankcs hankcs closed this as completed Dec 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants