You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that the first batch is empty. This happens because in the first iteration, currentlen is 0 and the expression len(x[0]) > batch_size is true (more tokens than batch size).
To fix this, you need the additional condition currentlen > 0.
batch_size = 2
x1 = [["Prof.", "Manning", "teaches", "NLP", "courses", "."]] # x1[0] is token list
data= [x1]
res = []
current = []
currentlen = 0
for x in data:
if len(x[0]) + currentlen > batch_size and currentlen > 0:
res.append(current)
current = []
currentlen = 0
current.append(x)
currentlen += len(x[0])
if currentlen > 0:
res.append(current)
print(res)
Hi @yascho, thanks for capturing this for us! This was indeed a bug. In practical cases, pos_batch_size is usually set to be a very large value (~1000), thus it may rarely be triggered. However I agree that we should fix this. Please make the PR and I'll merge it with the dev branch.
I agree that pos_batch_size is usually very large, but I encountered this problem while parsing a large-scale dataset (and the assertion [1] is not very helpful for debugging). Thus I thought it's worth to share it with you. Thanks
Hi, I found a bug in the POS-Tagging data code [1]. Currently, the tagger crashes if you tag sentences longer than batch size
pos_batch_size
:If you tag sentences longer than
pos_batch_size
, methodchunk_batches
[2] creates empty batches that subsequently trigger an assertion [3]:Output:
Note that the first batch is empty. This happens because in the first iteration,
currentlen
is 0 and the expressionlen(x[0]) > batch_size
is true (more tokens than batch size).To fix this, you need the additional condition
currentlen > 0
.Output:
The small batch size here might be exaggerated, however, sentences are usually of arbitrary length and can exceed large batch sizes (eg. in webpages).
I'd like to create a pull request to add the condition, if you agree.
[1] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py
[2] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py#L136
[3] https://github.com/stanfordnlp/stanfordnlp/blob/master/stanfordnlp/models/pos/data.py#L91
The text was updated successfully, but these errors were encountered: