All tokens are out of vocabulary in 2.0 #1204

bittlingmayer · 2017-07-20T13:58:53Z

vector is fine, but the is_oov bit is off.

>>> en = spacy.load('en_core_web_sm')
>>> x = en('This is a tessssst')
>>> [w.is_oov for w in x]
[True, True, True, True]
>>> en("welder")[0].is_oov
True

Python version: 3.5.2
Platform: Darwin-16.6.0-x86_64-i386-64bit
spaCy version: 2.0.0a0

The text was updated successfully, but these errors were encountered:

bittlingmayer · 2017-07-20T14:00:26Z

#1191 - 1.0 is the opposite, is_oov is False but vector is all 0s.

honnibal · 2017-07-21T23:04:09Z

Thanks. I didn't update the vocab data before exporting the model. .prob is broken too I think.

You can fix these by writing to the attributes:

lex = nlp.vocab[u'dog']
lex.prob = -3.4
lex.is_oov = False
doc = nlp(u'dog')
print(doc[0].prob, doc[0].is_oov)

bittlingmayer · 2017-08-14T12:09:17Z

If I understand the suggestion correctly, it won't work in this case, I can't write to the attributes because it is precisely their correct values that I need.

(For now I am just using Gensim with fastText-pretrained models for this functionality.)

honnibal · 2017-10-20T12:23:32Z

@bittlingmayer Sorry I missed this reply. I meant that you might get the values from the v1 model, and import them. Once you've saved out the model, the correct values will be there. The stable models will have the correct vocab data.

This is an issue with the model files rather than a bug in the codebase, so I'll relabel this.

ines · 2017-10-27T19:49:51Z

Merging this with #1457, as it'll be part of the same fix!

lock · 2018-05-08T12:27:35Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the bug Bugs and behaviour differing from documentation label Jul 21, 2017

honnibal added the 🌙 nightly Discussion and contributions related to nightly builds label Jul 21, 2017

honnibal added models Issues related to the statistical models performance and removed bug Bugs and behaviour differing from documentation labels Oct 20, 2017

ines mentioned this issue Oct 27, 2017

💫 Finalise vector support and add vector specs to model meta #1457

Closed

ines closed this as completed Oct 27, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All tokens are out of vocabulary in 2.0 #1204

All tokens are out of vocabulary in 2.0 #1204

bittlingmayer commented Jul 20, 2017

bittlingmayer commented Jul 20, 2017

honnibal commented Jul 21, 2017

bittlingmayer commented Aug 14, 2017

honnibal commented Oct 20, 2017

ines commented Oct 27, 2017

lock bot commented May 8, 2018

All tokens are out of vocabulary in 2.0 #1204

All tokens are out of vocabulary in 2.0 #1204

Comments

bittlingmayer commented Jul 20, 2017

bittlingmayer commented Jul 20, 2017

honnibal commented Jul 21, 2017

bittlingmayer commented Aug 14, 2017

honnibal commented Oct 20, 2017

ines commented Oct 27, 2017

lock bot commented May 8, 2018