Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All tokens are out of vocabulary in 2.0 #1204

Closed
bittlingmayer opened this issue Jul 20, 2017 · 6 comments
Closed

All tokens are out of vocabulary in 2.0 #1204

bittlingmayer opened this issue Jul 20, 2017 · 6 comments
Labels
models Issues related to the statistical models 🌙 nightly Discussion and contributions related to nightly builds

Comments

@bittlingmayer
Copy link
Contributor

vector is fine, but the is_oov bit is off.

>>> en = spacy.load('en_core_web_sm')
>>> x = en('This is a tessssst')
>>> [w.is_oov for w in x]
[True, True, True, True]
>>> en("welder")[0].is_oov
True
  • Python version: 3.5.2
  • Platform: Darwin-16.6.0-x86_64-i386-64bit
  • spaCy version: 2.0.0a0
@bittlingmayer
Copy link
Contributor Author

Related:

#315 - Some clients use is_oov as a spellchecker.

#1191 - 1.0 is the opposite, is_oov is False but vector is all 0s.

@honnibal honnibal added the bug Bugs and behaviour differing from documentation label Jul 21, 2017
@honnibal
Copy link
Member

Thanks. I didn't update the vocab data before exporting the model. .prob is broken too I think.

You can fix these by writing to the attributes:

lex = nlp.vocab[u'dog']
lex.prob = -3.4
lex.is_oov = False
doc = nlp(u'dog')
print(doc[0].prob, doc[0].is_oov)

@honnibal honnibal added the 🌙 nightly Discussion and contributions related to nightly builds label Jul 21, 2017
@bittlingmayer
Copy link
Contributor Author

If I understand the suggestion correctly, it won't work in this case, I can't write to the attributes because it is precisely their correct values that I need.

(For now I am just using Gensim with fastText-pretrained models for this functionality.)

@honnibal
Copy link
Member

@bittlingmayer Sorry I missed this reply. I meant that you might get the values from the v1 model, and import them. Once you've saved out the model, the correct values will be there. The stable models will have the correct vocab data.

This is an issue with the model files rather than a bug in the codebase, so I'll relabel this.

@honnibal honnibal added models Issues related to the statistical models performance and removed bug Bugs and behaviour differing from documentation labels Oct 20, 2017
@ines
Copy link
Member

ines commented Oct 27, 2017

Merging this with #1457, as it'll be part of the same fix!

@ines ines closed this as completed Oct 27, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
models Issues related to the statistical models 🌙 nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

No branches or pull requests

3 participants