Bart now enforces maximum sequence length in Summarization Pipeline #4224

pwschaedler · 2020-05-08T04:24:15Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...): Bart (bart-large-cnn)

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

Based on example code in docs, though.

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Load default summarization pipeline
Try to use model to summarize text that has > 1024 tokens

Example code:

from transformers import pipeline
summarizer = pipeline('summarization')
text = '=' * 102570    # Happened to be the length of the file I was testing, my actual file produced 25,257 tokens
print(summarizer(text))

Output:

Token indices sequence length is longer than the specified maximum sequence length for this model (1605 > 1024). Running this sequence through the model will result in indexing errors
Traceback (most recent call last):
  File "ex.py", line 4, in <module>
    print(summarizer(text, max_length=250))
  File ".../lib/python3.7/site-packages/transformers/pipelines.py", line 1330, in __call__
    inputs["input_ids"], attention_mask=inputs["attention_mask"], **generate_kwargs,
  File ".../lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_utils.py", line 1047, in generate
    encoder_outputs: tuple = encoder(input_ids, attention_mask=attention_mask)
  File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 292, in forward
    embed_pos = self.embed_positions(input_ids)
  File ".../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File ".../lib/python3.7/site-packages/transformers/modeling_bart.py", line 763, in forward
    return super().forward(positions)
  File ".../lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File ".../lib/python3.7/site-packages/torch/nn/functional.py", line 1724, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Expected behavior

As of last week (week of 4/26/2020) this caused no issue. Today (5/7/2020) I tried to run the exact same code, a new model was downloaded (no change in transformers module, just the model itself), and now it enforces a token limit.

Expected behavior is to summarize document regardless of size.

Environment info

transformers version: 2.8.0 (also occurs in 2.9.0)
Platform: Both macOS 10.15.4 and Windows 10
Python version: 3.7.5 (Mac) and 3.6.3/Anaconda (Windows)
PyTorch version (GPU?): 1.5.0, no GPU
Tensorflow version (GPU?): n/a
Using GPU in script?: no
Using distributed or parallel set-up in script?: no
Model (from associated JSON file downloaded): {"url": "https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/pytorch_model.bin", "etag": "\"6eeacfe81d9304a6c5015424912f8df8\""}
Model config:

{
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "length_penalty": 2.0,
  "max_length": 142,
  "max_position_embeddings": 1024,
  "min_length": 56,
  "model_type": "bart",
  "no_repeat_ngram_size": 3,
  "normalize_before": false,
  "num_beams": 4,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "prefix": " ",
  "scale_embedding": false,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty": 2.0,
      "max_length": 142,
      "min_length": 56,
      "no_repeat_ngram_size": 3,
      "num_beams": 4
    }
  },
  "vocab_size": 50264
}

EDIT: Tagging @sshleifer as recommended by docs

The text was updated successfully, but these errors were encountered:

@joeddav

* Rewritten batch support in pipelines. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Fix imports sorting 🔧 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Set pad_to_max_length=True by default on Pipeline. * Set pad_to_max_length=False for generation pipelines. Most of generation models doesn't have padding token. * Address @joeddav review comment: Uniformized *args. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Address @joeddav review comment: Uniformized *args (second). Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

sshleifer · 2020-05-08T14:15:40Z

#3857 might also be a culprit

sshleifer · 2020-05-10T22:45:00Z

@pwschaedler This is a change in pipelines that we may or may not undo. Previously, the tokenizer truncated your long documents to their beginnings
In the meantime, you can use this code on the latest transformers:

from transformers import BartForConditionalGeneration, BartTokenizer
from typing import List

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s) for s in summary_ids]
    return summaries

text = '=' * 10257  
old_summarization_pipeline(text)

pwschaedler · 2020-05-11T22:02:53Z

Great, thanks for the replacement code. The token limit (whether it's enforced or implied) might be worth mentioning on the pipeline docs.

sshleifer · 2020-05-13T12:11:18Z

Agreed! Would you be interested in sending a PR? The SummarizationPipeline docs live in docs/source/main_classes/pipelines.rst I believe.

JassimranK · 2020-07-03T07:30:05Z

Issue still exists when using summarisation pipeline. WARNING:transformers.tokenization_utils:Token indices sequence length is longer than the specified maximum sequence length for this model (2817 > 1024). Running this sequence through the model will result in indexing errors IndexError: index out of range in self
I saw the above work-around but when can we expect this to be fixed in summarization pipeline as well?

MLResearch42 · 2020-08-07T14:12:17Z

I am curious why the token limit in the summarization pipeline stops the process for the default model and for BART but not for the T-5 model? When running "t5-large" in the pipeline it will say "Token indices sequence length is longer than the specified maximum sequence length for this model (1069 > 512)" but it will still produce a summary. With the default model or "facebook/bart-large-cnn" models it will give a similar message "Token indices sequence length is longer than the specified maximum sequence length for this model (1034 > 1024)." but then fail to produce a summary (and give the following "index out of range in self"). Thanks!

sshleifer · 2020-08-07T14:40:05Z

Great Q, (prolly belongs on discuss.huggingface.co in the future :))

T5 uses a technique called relative position bucketing, whereas bart stores 1024 positional embeddings and then looks up each position in them.
Note that T5 will likely perform best with sequences <= 512 tokens, but you are correct that it won't error until OOM.

[relevant t5 code] (

transformers/src/transformers/modeling_t5.py

Line 242 in c67d1a0

    
           def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):

)

dipanjanS · 2020-09-18T04:58:32Z

@sshleifer what's the typical recommendation for summarization on larger documents? Chunk them and generate summaries or any other tips?

EDIT: Cross-posted here, I think this is a much better place for this.

This is what I use currently but open to better recommendations.

# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
  nested = []
  sent = []
  length = 0
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 1024:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = []
      length = 0

  if sent:
    nested.append(sent)
  return nested

# generate summary on text with <= 1024 tokens
def generate_summary(nested_sentences):
  device = 'cuda'
  summaries = []
  for nested in nested_sentences:
    input_tokenized = bart_tokenizer.encode(' '.join(nested), truncation=True, return_tensors='pt')
    input_tokenized = input_tokenized.to(device)
    summary_ids = bart_model.to(device).generate(input_tokenized,
                                      length_penalty=3.0,
                                      min_length=30,
                                      max_length=100)
    output = [bart_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids]
    summaries.append(output)
  summaries = [sentence for sublist in summaries for sentence in sublist]
  return summaries

echatzikyriakidis · 2020-09-20T09:53:14Z

Hi!

nest_sentences() has a bug. Whenever a chunk is ready to be saved in 'nested' the current sentence is ignored.

dipanjanS · 2020-09-20T14:58:15Z

Yes my bad one sentence is skipped, can be fixed as follows. Effects of implementing it in late hours ;)

Good catch @echatzikyriakidis thanks!

# generate chunks of text \ sentences <= 1024 tokens
def nest_sentences(document):
  nested = []
  sent = []
  length = 0
  for sentence in nltk.sent_tokenize(document):
    length += len(sentence)
    if length < 1024:
      sent.append(sentence)
    else:
      nested.append(sent)
      sent = [sentence]
      length = len(sentence)

  if sent:
    nested.append(sent)
  return nested

echatzikyriakidis · 2020-09-20T16:35:25Z

Hi @dipanjanS !

Thank you! This is exactly the way I did it also.

I think there is another catch.

What if a sentence is > 512 in case of T5 models or > 1024 in case of BART (rare scenario).

I think there will be no problem because of truncation=True, right? Or is going to fail? Maybe we need to skip it or split it in half.

dipanjanS · 2020-09-20T17:05:07Z

Great. I think in those cases 1024 is a hard coded magic number which can be configurable and replaced with the max length allowed by that specific model maybe as a function parameter

echatzikyriakidis · 2020-09-20T19:00:32Z

Hi @dipanjanS,

This is the way I have done it.

But again, what if a sentence is greater than the model's l max input length?

What will happen then?

dipanjanS · 2020-09-21T00:40:18Z

I think if we enforce the truncation parameter it should take care of it. By default it was being done in previous releases of transformers I think but now we might have to set it ourselves. But do check it out once.

…

On Mon, Sep 21, 2020, 00:30 Efstathios Chatzikyriakidis < ***@***.***> wrote: Hi @dipanjanS <https://github.com/dipanjanS>, This is the way I have done it. But again, what if a sentence is greater than the model's l max input length? What will happen then? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4224 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2J3R6Y2O3VR5KBUXYDUBTSGZGN3ANCNFSM4M3342EA> .

echatzikyriakidis · 2020-09-21T16:15:32Z

Hi @dipanjanS,

Exactly, I have tested it.

stale · 2020-11-24T02:58:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ig-perez · 2021-02-02T16:28:37Z

Hi @sshleifer first of all thanks for creating and maintaining this repo!

I'm exploring the pipelines and sadly the replacement code you shared no longer works.

I added truncation=True to the tokenizer.batch_encode_plus method but another error happened: ValueError: expected sequence of length 2 at dim 1 (got 3) in tokenization_utils_base.py

I saw in above discussion you were considering undoing this hard limit on the pipelines, perhaps the limit can be exposed in a configuration file or as a parameter?

Could you please suggest how to overcome the hard limit?

This is my current config:

[tool.poetry.dependencies]
python = "^3.8"
transformers = "^4.2.2"
torch = "^1.7.1"

No GPU
OS is Linux
Model: "sshleifer/distilbart-cnn-12-6"

Thanks!

constantin-huetterer · 2021-05-28T07:46:07Z

Hi @ig-perez ,
I realize this reply comes a little late to your question, but maybe it can still help you or someone else out. Here is the code from @sshleifer with some modifications to make it work for the current version.

def old_summarization_pipeline(text: List[str]) -> List[str]:
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    input_ids = tokenizer.batch_encode_plus(text, truncation=True, padding=True, return_tensors='pt', max_length=1024)['input_ids']
    summary_ids = model.generate(input_ids)
    summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=False) for s in summary_ids]
    return summaries

print(old_summarization_pipeline([ARTICLE_TO_SUMMARIZE, ARTICLE_TO_SUMMARIZE_2, ARTICLE_TO_SUMMARIZE2*400]))

I tried it with:

transformers=4.4.2
pytorch=1.8.0=py3.8_cuda10.2_cudnn7.6.5_0

artmatsak · 2021-12-23T14:43:00Z

Unfortunately, this problem also manifests when deploying BART on SageMaker via sagemaker.huggingface.HuggingFaceModel. When a request with > 1024 tokens is sent, the SageMaker endpoint crashes with an out-of-range CUDA error (we're using GPU instances). What's worse, subsequent requests with smaller inputs fail with the same CUDA error. The only fix is to redeploy the endpoint.

For now, we're using an encode-truncate-decode workaround like below, but there clearly has to be a better way:

# Inputs longer than 1024 tokens cause irrecoverable CUDA errors on
# SageMaker. Make sure that each text is at most 1024 tokens.
inputs = self.tokenizer(texts, max_length=1024, padding="longest",
                        truncation=True)
truncated_texts = [self.tokenizer.decode(i, skip_special_tokens=True, clean_up_tokenization_spaces=False)
                   for i in inputs["input_ids"]]
output = predictor.predict({"inputs": truncated_texts, "parameters": parameters})
summaries = [summary["summary_text"] for summary in output]

FurkanGozukara · 2022-10-28T20:38:05Z

@dipanjanS can you write a full code because it is missing a lot of parts

nltk missing
bart_tokenizer missing
bart_model missing

francescofeston · 2023-01-07T11:01:11Z

@dipanjanS Thanks for sharing your take on how to chunk large texts for summarization. I follow up on @FurkanGozukara's request: could you possibly provide the parts that are missing?
Thanks in advance for your help.

sshleifer self-assigned this May 8, 2020

sshleifer mentioned this issue May 8, 2020

Pipelines: use tokenizer.max_len #4236

Closed

yuyongze mentioned this issue Jul 22, 2020

pipeline does not do truncation on long texts input, error message found #5983

Closed

4 tasks

sshleifer assigned mfuntowicz and unassigned sshleifer Sep 23, 2020

antje mentioned this issue Oct 13, 2020

Remove [0:512] on predictions data-science-on-aws/data-science-on-aws#78

Closed

stale bot added the wontfix label Nov 24, 2020

stale bot closed this as completed Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bart now enforces maximum sequence length in Summarization Pipeline #4224

Bart now enforces maximum sequence length in Summarization Pipeline #4224

pwschaedler commented May 8, 2020 •

edited

Loading

sshleifer commented May 8, 2020

sshleifer commented May 10, 2020

pwschaedler commented May 11, 2020

sshleifer commented May 13, 2020

JassimranK commented Jul 3, 2020

MLResearch42 commented Aug 7, 2020

sshleifer commented Aug 7, 2020 •

edited

Loading

dipanjanS commented Sep 18, 2020 •

edited

Loading

echatzikyriakidis commented Sep 20, 2020

dipanjanS commented Sep 20, 2020

echatzikyriakidis commented Sep 20, 2020

dipanjanS commented Sep 20, 2020

echatzikyriakidis commented Sep 20, 2020

dipanjanS commented Sep 21, 2020 via email

echatzikyriakidis commented Sep 21, 2020

stale bot commented Nov 24, 2020

ig-perez commented Feb 2, 2021

constantin-huetterer commented May 28, 2021

artmatsak commented Dec 23, 2021

FurkanGozukara commented Oct 28, 2022

francescofeston commented Jan 7, 2023

Bart now enforces maximum sequence length in Summarization Pipeline #4224

Bart now enforces maximum sequence length in Summarization Pipeline #4224

Comments

pwschaedler commented May 8, 2020 • edited Loading

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

sshleifer commented May 8, 2020

sshleifer commented May 10, 2020

pwschaedler commented May 11, 2020

sshleifer commented May 13, 2020

JassimranK commented Jul 3, 2020

MLResearch42 commented Aug 7, 2020

sshleifer commented Aug 7, 2020 • edited Loading

dipanjanS commented Sep 18, 2020 • edited Loading

echatzikyriakidis commented Sep 20, 2020

dipanjanS commented Sep 20, 2020

echatzikyriakidis commented Sep 20, 2020

dipanjanS commented Sep 20, 2020

echatzikyriakidis commented Sep 20, 2020

dipanjanS commented Sep 21, 2020 via email

echatzikyriakidis commented Sep 21, 2020

stale bot commented Nov 24, 2020

ig-perez commented Feb 2, 2021

constantin-huetterer commented May 28, 2021

artmatsak commented Dec 23, 2021

FurkanGozukara commented Oct 28, 2022

francescofeston commented Jan 7, 2023

pwschaedler commented May 8, 2020 •

edited

Loading

sshleifer commented Aug 7, 2020 •

edited

Loading

dipanjanS commented Sep 18, 2020 •

edited

Loading