Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identity similarity is often 0.0. #1191

Closed
bittlingmayer opened this issue Jul 13, 2017 · 9 comments
Closed

Identity similarity is often 0.0. #1191

bittlingmayer opened this issue Jul 13, 2017 · 9 comments

Comments

@bittlingmayer
Copy link
Contributor

bittlingmayer commented Jul 13, 2017

expected: A word or doc has the same vector as itself, so x.similarity(x) is always 1.0.

actual: On a test of about 6K sentences, phrases and entities, about 2K were slightly > 1.0, and about 2K == 0.0.

import spacy
en = spacy.load('en')

from sys import argv
f, threshold = argv[1], float(argv[2])
with open(f) as f:
    for l in f:
        d = en(l.strip())
        if d.similarity(d) <= threshold:
            print(d)

I ran it like this:

python test.py mix.en.txt 0.0 > test.txt.0.0
python test.py mix.en.txt 0.99 > test.txt.0.99
python test.py mix.en.txt 1.0 > test.txt.1.0
wc -l mix.en.txt
wc -l test.txt*

The result:

    6375  mix.en.txt
    1794 test.txt.0.0
    1794 test.txt.0.99
    4008 test.txt.1.0

Here is mix.en.txt.

My env:

  • Installed models: en
  • Platform: Darwin-16.6.0-x86_64-i386-64bit
  • Python version: 3.5.2
  • spaCy version: 1.8.2
@bittlingmayer
Copy link
Contributor Author

It also happens in cases like en("Welder").similarity(en("welder")).

@bittlingmayer
Copy link
Contributor Author

As you will see, 0.0 is especially common for the city names and shorter phrases like job titles.

@honnibal
Copy link
Member

honnibal commented Jul 13, 2017

I think this is happening for OOV items, which are very common in the default small model in 1.x -- only the 5000 most common entries have a vector. The origin vector is given similarity 0.0 in this case, because it seems better than giving 1.0. I agree that identity should produce 1.0.

In v2 this will come up less, because the vectors are assigned by a convolutional neural network, so they take context into account for rare words. So if you use 2.0 the problem should be solved.

The quickest fix for you would be to overwrite the similarity function. Something like this should work:

def custom_similarity(x, y):
    if x.text.lower() == y.text.lower():
        return 1.0
    else:
        return cosine(x.vector, y.vector)

def cosine(x, y):
    return x.dot(y) / (((x**2).sum() * (y**2).sum())) ** 0.5

doc.user_hooks['similarity'] = custom_similarity

@honnibal honnibal added enhancement Feature requests and improvements performance and removed enhancement Feature requests and improvements labels Jul 13, 2017
@bittlingmayer
Copy link
Contributor Author

bittlingmayer commented Jul 13, 2017

Yes and no.

>>> en("welder").vector
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)
>>> en("welder")[0].is_oov
False

The content in mix.en.txt is not that arcane -- world capitals and so on.

Will install nightly and try again.

What about single words or phrases where there is no context? In that case, the fastText approach is to use char-level n-grams.

@bittlingmayer
Copy link
Contributor Author

en = spacy.load('en_core_web_sm')
en("welder").similarity(en("welder"))
en("netimvortshatts").similarity(en("netimvortshatts"))

Result:

0.99999999919917582
0.99999999741823353

The 0.0 problem is fixed, there are still a few above 1.0.

python test.py mix.en.txt 0.0 > test.txt.0.0
python test.py mix.en.txt 1.0 > test.txt.1.0
wc -l mix.en.txt
wc -l test.txt*

The result:

    6375 mix.en.txt
    0 test.txt.0.0
    3158 test.txt.1.0

@bittlingmayer
Copy link
Contributor Author

I know it's a rounding error, but it could improve performance to short-circuit if the strings are equal.

@ines
Copy link
Member

ines commented Nov 9, 2017

Should be improved in v2.0 and the new models! Details:

@ines ines closed this as completed Nov 9, 2017
@bittlingmayer
Copy link
Contributor Author

In my naive view it would be easy to avoid this with a line or two of code:

Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

And I assume it would be desirable, but maybe there is a counterargument?

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants