Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POS-Tagging long sentences #135

Closed
yascho opened this issue Sep 17, 2019 · 2 comments
Closed

POS-Tagging long sentences #135

yascho opened this issue Sep 17, 2019 · 2 comments
Labels

Comments

@yascho
Copy link

yascho commented Sep 17, 2019

Hi, I found a bug in the POS-Tagging data code [1]. Currently, the tagger crashes if you tag sentences longer than batch size pos_batch_size:

nlp = stanfordnlp.Pipeline(processors='tokenize,pos', pos_batch_size=2, 
                           models_dir="/stanfordnlp_resources/", treebank='en_ewt')

nlp("Prof. Manning teaches NLP courses.")

---------------------------------------------------------------------------
AssertionError  

If you tag sentences longer than pos_batch_size, method chunk_batches [2] creates empty batches that subsequently trigger an assertion [3]:

batch_size = 2
x1 = [["Prof.", "Manning", "teaches", "NLP", "courses", "."]]
data= [x1] 

res = []
current = []
currentlen = 0

for x in data:
    if len(x[0]) + currentlen > batch_size:
        res.append(current)
        current = []
        currentlen = 0
    current.append(x)
    currentlen += len(x[0])

if currentlen > 0:
    res.append(current)
    
print(res)

Output:

[[], [[['Prof.', 'Manning', 'teaches', 'NLP', 'courses', '.']]]]

Note that the first batch is empty. This happens because in the first iteration, currentlen is 0 and the expression len(x[0]) > batch_size is true (more tokens than batch size).
To fix this, you need the additional condition currentlen > 0.

batch_size = 2
x1 = [["Prof.", "Manning", "teaches", "NLP", "courses", "."]] # x1[0] is token list
data= [x1] 

res = []
current = []
currentlen = 0

for x in data:
    if len(x[0]) + currentlen > batch_size and currentlen > 0:
        res.append(current)
        current = []
        currentlen = 0
    current.append(x)
    currentlen += len(x[0])

if currentlen > 0:
    res.append(current)
    
print(res)

Output:

[[[['Prof.', 'Manning', 'teaches', 'NLP', 'courses', '.']]]]

The small batch size here might be exaggerated, however, sentences are usually of arbitrary length and can exceed large batch sizes (eg. in webpages).

I'd like to create a pull request to add the condition, if you agree.

[1] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py
[2] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py#L136
[3] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py#L91

@yascho yascho changed the title POS-Tagging of long sentences POS-Tagging long sentences Sep 17, 2019
@yuhaozhang
Copy link
Member

Hi @yascho, thanks for capturing this for us! This was indeed a bug. In practical cases, pos_batch_size is usually set to be a very large value (~1000), thus it may rarely be triggered. However I agree that we should fix this. Please make the PR and I'll merge it with the dev branch.

@yascho
Copy link
Author

yascho commented Sep 17, 2019

I agree that pos_batch_size is usually very large, but I encountered this problem while parsing a large-scale dataset (and the assertion [1] is not very helpful for debugging). Thus I thought it's worth to share it with you. Thanks

[1] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py#L91

@yascho yascho closed this as completed Sep 17, 2019
yuhaozhang added a commit that referenced this issue Sep 17, 2019
Fixing Issue #135 (POS-Tagging long sentences)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants