POS-Tagging long sentences #135

yascho · 2019-09-17T17:17:12Z

Hi, I found a bug in the POS-Tagging data code [1]. Currently, the tagger crashes if you tag sentences longer than batch size pos_batch_size:

nlp = stanfordnlp.Pipeline(processors='tokenize,pos', pos_batch_size=2, 
                           models_dir="/stanfordnlp_resources/", treebank='en_ewt')

nlp("Prof. Manning teaches NLP courses.")

---------------------------------------------------------------------------
AssertionError

If you tag sentences longer than pos_batch_size, method chunk_batches [2] creates empty batches that subsequently trigger an assertion [3]:

batch_size = 2
x1 = [["Prof.", "Manning", "teaches", "NLP", "courses", "."]]
data= [x1] 

res = []
current = []
currentlen = 0

for x in data:
    if len(x[0]) + currentlen > batch_size:
        res.append(current)
        current = []
        currentlen = 0
    current.append(x)
    currentlen += len(x[0])

if currentlen > 0:
    res.append(current)
    
print(res)

Output:

[[], [[['Prof.', 'Manning', 'teaches', 'NLP', 'courses', '.']]]]

Note that the first batch is empty. This happens because in the first iteration, currentlen is 0 and the expression len(x[0]) > batch_size is true (more tokens than batch size).
To fix this, you need the additional condition currentlen > 0.

batch_size = 2
x1 = [["Prof.", "Manning", "teaches", "NLP", "courses", "."]] # x1[0] is token list
data= [x1] 

res = []
current = []
currentlen = 0

for x in data:
    if len(x[0]) + currentlen > batch_size and currentlen > 0:
        res.append(current)
        current = []
        currentlen = 0
    current.append(x)
    currentlen += len(x[0])

if currentlen > 0:
    res.append(current)
    
print(res)

Output:

[[[['Prof.', 'Manning', 'teaches', 'NLP', 'courses', '.']]]]

The small batch size here might be exaggerated, however, sentences are usually of arbitrary length and can exceed large batch sizes (eg. in webpages).

I'd like to create a pull request to add the condition, if you agree.

[1] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py
[2] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py#L136
[3] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py#L91

The text was updated successfully, but these errors were encountered:

yuhaozhang · 2019-09-17T18:26:41Z

Hi @yascho, thanks for capturing this for us! This was indeed a bug. In practical cases, pos_batch_size is usually set to be a very large value (~1000), thus it may rarely be triggered. However I agree that we should fix this. Please make the PR and I'll merge it with the dev branch.

yascho · 2019-09-17T18:59:39Z

I agree that pos_batch_size is usually very large, but I encountered this problem while parsing a large-scale dataset (and the assertion [1] is not very helpful for debugging). Thus I thought it's worth to share it with you. Thanks

[1] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py#L91

Fixing Issue #135 (POS-Tagging long sentences)

yascho changed the title ~~POS-Tagging of long sentences~~ POS-Tagging long sentences Sep 17, 2019

yuhaozhang added the bug label Sep 17, 2019

yascho mentioned this issue Sep 17, 2019

Fixing Issue #135 (POS-Tagging long sentences) #136

Merged

yascho closed this as completed Sep 17, 2019

yuhaozhang added a commit that referenced this issue Sep 17, 2019

Merge pull request #136 from yascho/dev

af908ad

Fixing Issue #135 (POS-Tagging long sentences)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POS-Tagging long sentences #135

POS-Tagging long sentences #135

yascho commented Sep 17, 2019 •

edited

Loading

yuhaozhang commented Sep 17, 2019

yascho commented Sep 17, 2019 •

edited

Loading

POS-Tagging long sentences #135

POS-Tagging long sentences #135

Comments

yascho commented Sep 17, 2019 • edited Loading

yuhaozhang commented Sep 17, 2019

yascho commented Sep 17, 2019 • edited Loading

yascho commented Sep 17, 2019 •

edited

Loading

yascho commented Sep 17, 2019 •

edited

Loading