-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accuracy rate seems to be 10% lower than the original version #21
Comments
Hi Hancks! Thank you for your feed back. I implemented this almost 2 years ago and I forgot details. I used Lucene at some points and when I did them, I had some compromise, and the result cannot be same. But as you kindly reported, the results were not good. |
Thank you for your reply. This implementation is the best one in Java, since the others yield worse accuracy rates. I will look into your early commits and compare it with the original C code carefully. There must be some different. |
Thanks for you comment again. I don't think I can take a proactive action about this issue because I'm in the current project and don't have time, but I'm happy to help you if you find some different and ask me. I'd like to do my best to remember why I implemented in different way and will improve them, if possible. Keep in touch! |
Hi Mr. Sekiguchi, After a long time, a friend @tiandiweizun and me finally find the difference between this version and Google's. The reason of difference scores is that the parsing logics of command line are different. When performing I've submitted a pull request, you may consider merging it for your convenience. Thank you. |
Hello, kojisekig.
Thank you for your nice Java codes. This is the closest version compared to Google's original C version.
But I computed the accuracy rate, and it is 10% lower than the original version.
I trained on text8 with exactly the same parameters, which are:
Note I used your com.rondhuit.w2v.Text8Splitter to cut text8 to multilines, I think it does not affect the result, since both implementation's MAX_WORDS are 1000.
Then I translated compute-accuracy.c from Google's C code to Java code, and run the test with the same parameters:
The result is really surprising.
Your Java implementation:
Google's C implementation:
Can you give me any suggestions or ideas about this? I am ready to help you if needed. I think we both want to make this the best Java word2vec!
You can re-run this test after merging my pull request #20 .
Thank you!
The text was updated successfully, but these errors were encountered: