-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identity similarity is often 0.0. #1191
Comments
It also happens in cases like |
As you will see, 0.0 is especially common for the city names and shorter phrases like job titles. |
I think this is happening for OOV items, which are very common in the default small model in 1.x -- only the 5000 most common entries have a vector. The origin vector is given similarity 0.0 in this case, because it seems better than giving 1.0. I agree that identity should produce 1.0. In v2 this will come up less, because the vectors are assigned by a convolutional neural network, so they take context into account for rare words. So if you use 2.0 the problem should be solved. The quickest fix for you would be to overwrite the similarity function. Something like this should work:
|
Yes and no.
The content in mix.en.txt is not that arcane -- world capitals and so on. Will install nightly and try again. What about single words or phrases where there is no context? In that case, the fastText approach is to use char-level n-grams. |
Result:
The 0.0 problem is fixed, there are still a few above 1.0.
The result:
|
I know it's a rounding error, but it could improve performance to short-circuit if the strings are equal. |
Should be improved in v2.0 and the new models! Details:
|
In my naive view it would be easy to avoid this with a line or two of code:
And I assume it would be desirable, but maybe there is a counterargument? |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
expected: A word or doc has the same vector as itself, so
x.similarity(x)
is always 1.0.actual: On a test of about 6K sentences, phrases and entities, about 2K were slightly > 1.0, and about 2K == 0.0.
I ran it like this:
The result:
Here is mix.en.txt.
My env:
The text was updated successfully, but these errors were encountered: