-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bart now enforces maximum sequence length in Summarization Pipeline #4224
Comments
* Rewritten batch support in pipelines. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix imports sorting 🔧 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Set pad_to_max_length=True by default on Pipeline. * Set pad_to_max_length=False for generation pipelines. Most of generation models doesn't have padding token. * Address @joeddav review comment: Uniformized *args. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Address @joeddav review comment: Uniformized *args (second). Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
#3857 might also be a culprit |
@pwschaedler This is a change in pipelines that we may or may not undo. Previously, the tokenizer truncated your long documents to their beginnings from transformers import BartForConditionalGeneration, BartTokenizer
from typing import List
def old_summarization_pipeline(text: List[str]) -> List[str]:
tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
input_ids = tokenizer.batch_encode_plus(text, return_tensors='pt', max_length=1024)['input_ids']
summary_ids = model.generate(input_ids)
summaries = [tokenizer.decode(s) for s in summary_ids]
return summaries text = '=' * 10257
old_summarization_pipeline(text) |
Great, thanks for the replacement code. The token limit (whether it's enforced or implied) might be worth mentioning on the pipeline docs. |
Agreed! Would you be interested in sending a PR? The SummarizationPipeline docs live in |
Issue still exists when using summarisation pipeline. |
I am curious why the token limit in the summarization pipeline stops the process for the default model and for BART but not for the T-5 model? When running "t5-large" in the pipeline it will say "Token indices sequence length is longer than the specified maximum sequence length for this model (1069 > 512)" but it will still produce a summary. With the default model or "facebook/bart-large-cnn" models it will give a similar message "Token indices sequence length is longer than the specified maximum sequence length for this model (1034 > 1024)." but then fail to produce a summary (and give the following "index out of range in self"). Thanks! |
Great Q, (prolly belongs on discuss.huggingface.co in the future :)) T5 uses a technique called relative position bucketing, whereas bart stores 1024 positional embeddings and then looks up each position in them. [relevant t5 code] ( transformers/src/transformers/modeling_t5.py Line 242 in c67d1a0
|
@sshleifer what's the typical recommendation for summarization on larger documents? Chunk them and generate summaries or any other tips? EDIT: Cross-posted here, I think this is a much better place for this. This is what I use currently but open to better recommendations. # generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
nested = []
sent = []
length = 0
for sentence in nltk.sent_tokenize(document):
length += len(sentence)
if length < 1024:
sent.append(sentence)
else:
nested.append(sent)
sent = []
length = 0
if sent:
nested.append(sent)
return nested
# generate summary on text with <= 1024 tokens
def generate_summary(nested_sentences):
device = 'cuda'
summaries = []
for nested in nested_sentences:
input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
input_tokenized = input_tokenized.to(device)
summary_ids = bart_model.to(device).generate(input_tokenized,
length_penalty=3.0,
min_length=30,
max_length=100)
output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
summaries.append(output)
summaries = [sentence for sublist in summaries for sentence in sublist]
return summaries |
Hi! nest_sentences() has a bug. Whenever a chunk is ready to be saved in 'nested' the current sentence is ignored. |
Yes my bad one sentence is skipped, can be fixed as follows. Effects of implementing it in late hours ;) Good catch @echatzikyriakidis thanks! # generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
nested = []
sent = []
length = 0
for sentence in nltk.sent_tokenize(document):
length += len(sentence)
if length < 1024:
sent.append(sentence)
else:
nested.append(sent)
sent = [sentence]
length = len(sentence)
if sent:
nested.append(sent)
return nested |
Hi @dipanjanS ! Thank you! This is exactly the way I did it also. I think there is another catch. What if a sentence is > 512 in case of T5 models or > 1024 in case of BART (rare scenario). I think there will be no problem because of truncation=True, right? Or is going to fail? Maybe we need to skip it or split it in half. |
Great. I think in those cases 1024 is a hard coded magic number which can be configurable and replaced with the max length allowed by that specific model maybe as a function parameter |
Hi @dipanjanS, This is the way I have done it. But again, what if a sentence is greater than the model's l max input length? What will happen then? |
I think if we enforce the truncation parameter it should take care of it.
By default it was being done in previous releases of transformers I think
but now we might have to set it ourselves. But do check it out once.
…On Mon, Sep 21, 2020, 00:30 Efstathios Chatzikyriakidis < ***@***.***> wrote:
Hi @dipanjanS <https://github.com/dipanjanS>,
This is the way I have done it.
But again, what if a sentence is greater than the model's l max input
length?
What will happen then?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4224 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2J3R6Y2O3VR5KBUXYDUBTSGZGN3ANCNFSM4M3342EA>
.
|
Hi @dipanjanS, Exactly, I have tested it. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi @sshleifer first of all thanks for creating and maintaining this repo! I'm exploring the pipelines and sadly the replacement code you shared no longer works. I added I saw in above discussion you were considering undoing this hard limit on the pipelines, perhaps the limit can be exposed in a configuration file or as a parameter? Could you please suggest how to overcome the hard limit? This is my current config:
Thanks! |
Hi @ig-perez , def old_summarization_pipeline(text: List[str]) -> List[str]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
input_ids = tokenizer.batch_encode_plus(text, truncation=True, padding=True, return_tensors='pt', max_length=1024)['input_ids']
summary_ids = model.generate(input_ids)
summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=False) for s in summary_ids]
return summaries
print(old_summarization_pipeline([ARTICLE_TO_SUMMARIZE, ARTICLE_TO_SUMMARIZE_2, ARTICLE_TO_SUMMARIZE2*400])) I tried it with:
|
Unfortunately, this problem also manifests when deploying BART on SageMaker via For now, we're using an encode-truncate-decode workaround like below, but there clearly has to be a better way:
|
@dipanjanS can you write a full code because it is missing a lot of parts nltk missing |
@dipanjanS Thanks for sharing your take on how to chunk large texts for summarization. I follow up on @FurkanGozukara's request: could you possibly provide the parts that are missing? |
🐛 Bug
Information
Model I am using (Bert, XLNet ...): Bart (bart-large-cnn)
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
Based on example code in docs, though.
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Example code:
Output:
Expected behavior
As of last week (week of 4/26/2020) this caused no issue. Today (5/7/2020) I tried to run the exact same code, a new model was downloaded (no change in transformers module, just the model itself), and now it enforces a token limit.
Expected behavior is to summarize document regardless of size.
Environment info
transformers
version: 2.8.0 (also occurs in 2.9.0){"url": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin", "etag": "\"6eeacfe81d9304a6c5015424912f8df8\""}
EDIT: Tagging @sshleifer as recommended by docs
The text was updated successfully, but these errors were encountered: