Missing vectors in Spacy models #1341

znat · 2017-09-20T15:52:32Z

The default English model installs vectors for one million vocabulary entries

But vectors are missing from words I would expect to be in that million.

>>> nlp(u"insurer")[0].has_vector
False
>>> nlp(u"indian")[0].has_vector
False
>>> nlp(u"plumber")[0].has_vector
False
>>> nlp(u"pharmacy")[0].has_vector
False
>>> nlp(u"electrician")[0].has_vector
False
>>> nlp(u"queens")[0].has_vector
False
>>> nlp(u"queen")[0].has_vector
True

Model info:

lang: en
name: core_web_md
license: CC BY-SA 3.0
author: Explosion AI
url: https://explosion.ai
source: /Users/nzylber1/anaconda/envs/rasa/lib/python2.7/site-packages/en_core_web_md/en_core_web_md-1.2.1
version: 1.2.1
spacy_version: >=1.7.0,<2.0.0
email: contact@explosion.ai

The text was updated successfully, but these errors were encountered:

honnibal · 2017-09-20T17:27:03Z

That's interesting. You can always add more vectors by assigning to nlp.vocab[word].vector.

I'm not sure whether there was a problem with the way the vector data was pruned. This was done quite some time ago, so it's possible there was a mistake. You might check whether the en_vectors_glove_md model has the same problem.

In spaCy 2 (installable via spacy-nightly) if you download en_vectors_web_lg you'll get the unpruned GloVe common crawl vectors.

nsecord · 2017-10-20T10:00:34Z

I've done some experiments with en_core_web_sm in SpaCy 1.9 and SpaCy 2 and I have some similar comments.

I can confirm that all of the words @znat lists are in the vocabulary of the model (there is a lexeme for each word) but none of them have vectors. In en_core_web_sm-1.2.0, the vocabulary contains 742470 entries but if you do a simple list comprehension [lexeme.text for lexeme in nlp.vocab if lexeme.has_vector] you end up with a list of 22852 words. This seems about right because when you look into the model package vec.bin file in the vocab folder is 12.1 Mbytes.

It's not catastrophic but as @znat points out the SpaCy 1.x documentation does say that the default model has vectors for a vocabulary of 1 million words and en_core_web_sm is the default model pointed to by shortcuts.json in the spacy-models repository.

In SpaCy 2, the situation is even more confusing in that the en_core_web_sm-2.0.0a7 package does not appear to contain any vectors at all, even though the documentation indicates that it should. You can confirm this by again performing the list comprehension above and you will get a result of 0. A look at the vectors file in vocab folder of the package also shows that it is only 80 bytes long which is a good indication. The vocabulary for the 2.0 model also appears to have only 57392 entries.

In general, the documentation is very ambiguous about what is included in each model (global vocabulary size, how many words have vectors, etc). The release notes and documentation for both en_core_web_sm and en_core_web_lg say that their sources are OntoNotes 5 and Common Crawl but there is no indication of how the corpus was trimmed down to make the model smaller for en_core_web_sm.

ines · 2017-10-24T15:21:25Z

@nsecord Thanks a lot for your detailed feedback – I definitely agree. I've opened up a v2.0 issue on this subject in #1457 and will merge these two issues so we can keep a better overview of what's still left to do before we can retrain the stable v2.0 models.

#1457 also includes some suggestions – e.g. reading the vector specs off the model automatically, and including them in model's meta data. In the stable v2.0, the tensorizer will also be wired up properly, giving you access to context-sensitive token vectors.

lock · 2018-05-08T12:27:57Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added models Issues related to the statistical models lang / en English language data and models labels Oct 4, 2017

ines mentioned this issue Oct 24, 2017

💫 Finalise vector support and add vector specs to model meta #1457

Closed

ines closed this as completed Oct 24, 2017

ines mentioned this issue Oct 25, 2017

💫 Add vector meta data to model meta.json on train/package and show in docs #1462

Merged

3 tasks

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing vectors in Spacy models #1341

Missing vectors in Spacy models #1341

znat commented Sep 20, 2017

honnibal commented Sep 20, 2017

nsecord commented Oct 20, 2017 •

edited

Loading

ines commented Oct 24, 2017 •

edited

Loading

lock bot commented May 8, 2018

Missing vectors in Spacy models #1341

Missing vectors in Spacy models #1341

Comments

znat commented Sep 20, 2017

honnibal commented Sep 20, 2017

nsecord commented Oct 20, 2017 • edited Loading

ines commented Oct 24, 2017 • edited Loading

lock bot commented May 8, 2018

nsecord commented Oct 20, 2017 •

edited

Loading

ines commented Oct 24, 2017 •

edited

Loading