Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing vectors in Spacy models #1341

Closed
znat opened this issue Sep 20, 2017 · 4 comments
Closed

Missing vectors in Spacy models #1341

znat opened this issue Sep 20, 2017 · 4 comments
Labels
lang / en English language data and models models Issues related to the statistical models

Comments

@znat
Copy link

znat commented Sep 20, 2017

The default English model installs vectors for one million vocabulary entries

But vectors are missing from words I would expect to be in that million.

>>> nlp(u"insurer")[0].has_vector
False
>>> nlp(u"indian")[0].has_vector
False
>>> nlp(u"plumber")[0].has_vector
False
>>> nlp(u"pharmacy")[0].has_vector
False
>>> nlp(u"electrician")[0].has_vector
False
>>> nlp(u"queens")[0].has_vector
False
>>> nlp(u"queen")[0].has_vector
True

Model info:

lang: en
name: core_web_md
license: CC BY-SA 3.0
author: Explosion AI
url: https://explosion.ai
source: /Users/nzylber1/anaconda/envs/rasa/lib/python2.7/site-packages/en_core_web_md/en_core_web_md-1.2.1
version: 1.2.1
spacy_version: >=1.7.0,<2.0.0
email: contact@explosion.ai

@honnibal
Copy link
Member

That's interesting. You can always add more vectors by assigning to nlp.vocab[word].vector.

I'm not sure whether there was a problem with the way the vector data was pruned. This was done quite some time ago, so it's possible there was a mistake. You might check whether the en_vectors_glove_md model has the same problem.

In spaCy 2 (installable via spacy-nightly) if you download en_vectors_web_lg you'll get the unpruned GloVe common crawl vectors.

@ines ines added models Issues related to the statistical models lang / en English language data and models labels Oct 4, 2017
@nsecord
Copy link

nsecord commented Oct 20, 2017

I've done some experiments with en_core_web_sm in SpaCy 1.9 and SpaCy 2 and I have some similar comments.

I can confirm that all of the words @znat lists are in the vocabulary of the model (there is a lexeme for each word) but none of them have vectors. In en_core_web_sm-1.2.0, the vocabulary contains 742470 entries but if you do a simple list comprehension [lexeme.text for lexeme in nlp.vocab if lexeme.has_vector] you end up with a list of 22852 words. This seems about right because when you look into the model package vec.bin file in the vocab folder is 12.1 Mbytes.

It's not catastrophic but as @znat points out the SpaCy 1.x documentation does say that the default model has vectors for a vocabulary of 1 million words and en_core_web_sm is the default model pointed to by shortcuts.json in the spacy-models repository.

In SpaCy 2, the situation is even more confusing in that the en_core_web_sm-2.0.0a7 package does not appear to contain any vectors at all, even though the documentation indicates that it should. You can confirm this by again performing the list comprehension above and you will get a result of 0. A look at the vectors file in vocab folder of the package also shows that it is only 80 bytes long which is a good indication. The vocabulary for the 2.0 model also appears to have only 57392 entries.

In general, the documentation is very ambiguous about what is included in each model (global vocabulary size, how many words have vectors, etc). The release notes and documentation for both en_core_web_sm and en_core_web_lg say that their sources are OntoNotes 5 and Common Crawl but there is no indication of how the corpus was trimmed down to make the model smaller for en_core_web_sm.

@ines
Copy link
Member

ines commented Oct 24, 2017

@nsecord Thanks a lot for your detailed feedback – I definitely agree. I've opened up a v2.0 issue on this subject in #1457 and will merge these two issues so we can keep a better overview of what's still left to do before we can retrain the stable v2.0 models.

#1457 also includes some suggestions – e.g. reading the vector specs off the model automatically, and including them in model's meta data. In the stable v2.0, the tensorizer will also be wired up properly, giving you access to context-sensitive token vectors.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / en English language data and models models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

4 participants