Identity similarity is often 0.0. #1191

bittlingmayer · 2017-07-13T14:31:33Z

expected: A word or doc has the same vector as itself, so x.similarity(x) is always 1.0.

actual: On a test of about 6K sentences, phrases and entities, about 2K were slightly > 1.0, and about 2K == 0.0.

import spacy
en = spacy.load('en')

from sys import argv
f, threshold = argv[1], float(argv[2])
with open(f) as f:
    for l in f:
        d = en(l.strip())
        if d.similarity(d) <= threshold:
            print(d)

I ran it like this:

python test.py mix.en.txt 0.0 > test.txt.0.0
python test.py mix.en.txt 0.99 > test.txt.0.99
python test.py mix.en.txt 1.0 > test.txt.1.0
wc -l mix.en.txt
wc -l test.txt*

The result:

    6375  mix.en.txt
    1794 test.txt.0.0
    1794 test.txt.0.99
    4008 test.txt.1.0

Here is mix.en.txt.

My env:

Installed models: en
Platform: Darwin-16.6.0-x86_64-i386-64bit
Python version: 3.5.2
spaCy version: 1.8.2

The text was updated successfully, but these errors were encountered:

bittlingmayer · 2017-07-13T14:32:11Z

It also happens in cases like en("Welder").similarity(en("welder")).

bittlingmayer · 2017-07-13T14:33:08Z

As you will see, 0.0 is especially common for the city names and shorter phrases like job titles.

honnibal · 2017-07-13T16:02:12Z

I think this is happening for OOV items, which are very common in the default small model in 1.x -- only the 5000 most common entries have a vector. The origin vector is given similarity 0.0 in this case, because it seems better than giving 1.0. I agree that identity should produce 1.0.

In v2 this will come up less, because the vectors are assigned by a convolutional neural network, so they take context into account for rare words. So if you use 2.0 the problem should be solved.

The quickest fix for you would be to overwrite the similarity function. Something like this should work:

def custom_similarity(x, y):
    if x.text.lower() == y.text.lower():
        return 1.0
    else:
        return cosine(x.vector, y.vector)

def cosine(x, y):
    return x.dot(y) / (((x**2).sum() * (y**2).sum())) ** 0.5

doc.user_hooks['similarity'] = custom_similarity

bittlingmayer · 2017-07-13T16:13:00Z

Yes and no.

>>> en("welder").vector
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)
>>> en("welder")[0].is_oov
False

The content in mix.en.txt is not that arcane -- world capitals and so on.

Will install nightly and try again.

What about single words or phrases where there is no context? In that case, the fastText approach is to use char-level n-grams.

bittlingmayer · 2017-07-13T16:38:33Z

en = spacy.load('en_core_web_sm')
en("welder").similarity(en("welder"))
en("netimvortshatts").similarity(en("netimvortshatts"))

Result:

0.99999999919917582
0.99999999741823353

The 0.0 problem is fixed, there are still a few above 1.0.

python test.py mix.en.txt 0.0 > test.txt.0.0
python test.py mix.en.txt 1.0 > test.txt.1.0
wc -l mix.en.txt
wc -l test.txt*

The result:

    6375 mix.en.txt
    0 test.txt.0.0
    3158 test.txt.1.0

bittlingmayer · 2017-07-13T16:40:15Z

I know it's a rounding error, but it could improve performance to short-circuit if the strings are equal.

ines · 2017-11-09T15:41:24Z

Should be improved in v2.0 and the new models! Details:

New models
Vectors & similarity in v2.0

bittlingmayer · 2017-12-16T14:29:15Z

In my naive view it would be easy to avoid this with a line or two of code:

Identical tokens are obviously 100% similar to each other (just not always exactly 1.0, because of vector math and floating point imprecisions).

And I assume it would be desirable, but maybe there is a counterargument?

lock · 2018-05-08T05:55:03Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added enhancement Feature requests and improvements performance and removed enhancement Feature requests and improvements labels Jul 13, 2017

bittlingmayer mentioned this issue Jul 20, 2017

All tokens are out of vocabulary in 2.0 #1204

Closed

bittlingmayer mentioned this issue Sep 8, 2017

Vector lookup via vocab and lexeme doesn't work #1310

Closed

ines closed this as completed Nov 9, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identity similarity is often 0.0. #1191

Identity similarity is often 0.0. #1191

bittlingmayer commented Jul 13, 2017 •

edited

Loading

bittlingmayer commented Jul 13, 2017

bittlingmayer commented Jul 13, 2017

honnibal commented Jul 13, 2017 •

edited

Loading

bittlingmayer commented Jul 13, 2017 •

edited

Loading

bittlingmayer commented Jul 13, 2017

bittlingmayer commented Jul 13, 2017

ines commented Nov 9, 2017

bittlingmayer commented Dec 16, 2017

lock bot commented May 8, 2018

Identity similarity is often 0.0. #1191

Identity similarity is often 0.0. #1191

Comments

bittlingmayer commented Jul 13, 2017 • edited Loading

bittlingmayer commented Jul 13, 2017

bittlingmayer commented Jul 13, 2017

honnibal commented Jul 13, 2017 • edited Loading

bittlingmayer commented Jul 13, 2017 • edited Loading

bittlingmayer commented Jul 13, 2017

bittlingmayer commented Jul 13, 2017

ines commented Nov 9, 2017

bittlingmayer commented Dec 16, 2017

lock bot commented May 8, 2018

bittlingmayer commented Jul 13, 2017 •

edited

Loading

honnibal commented Jul 13, 2017 •

edited

Loading

bittlingmayer commented Jul 13, 2017 •

edited

Loading