Range Error for BERT Masked Language Modeling on IMDB #16846

Jadiker · 2022-04-20T07:34:24Z

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.10.0+cu111 (False)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

Who can help?

@LysandreJik

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

https://colab.research.google.com/drive/1ZpYRkJVMF5r3MukUheEFtgDvqax4YCxM?usp=sharing

Expected behavior

Evaluation to complete and give me a perplexity score, as it does [here](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/chapter7/section3_tf.ipynb)

The text was updated successfully, but these errors were encountered:

gante · 2022-04-21T13:53:41Z

Hi @Jadiker 👋 In your notebook, after the tokenize_and_chunk cell (the 2nd one, there are 2) we can see a warning that explains the error: Token indices sequence length is longer than the specified maximum sequence length for this model (521 > 512). Running this sequence through the model will result in indexing errors.

If you add truncation=True in the tokenizer call in that cell you should be able to solve the problem. Let me know if it worked :)

Jadiker · 2022-04-26T06:15:50Z

Nope, same error: indices[15,32] = -9223372036854775808 is not in [0, 28996). (I've edited the notebook with the change.)

gante · 2022-04-26T10:59:21Z

@Jadiker Thank you for the update :) The problem seems to raise from your custom tokenization function, which is likely not returning the correct data format. See this notebook, which successfully runs your code if we skip tokenize_and_chunk. Inside tokenize_and_chunk, you append chunks.append(all_input_ids[idx: idx + context_length]), which would explain the indexing errors.

We also reserve these GitHub issues for bugs in the repository and/or feature requests. For any other requests, like issues in your custom code, we'd like to invite you to use our forum 🤗 I'm closing this issue, but feel free to reopen with queries that fit the criteria I described.

Jadiker · 2022-04-27T01:55:49Z

@gante Thanks for your time and for the information! I really appreciate it.

Two comments:

All the code (including the tokenize_and_chunk function) in the notebook is directly from Hugging Face. It comes from this notebook which is linked in this tutorial on data processing. The only thing I have done is added code after the data processing in order to actually train a model on the processed data. (And the code for training the model comes from this Hugging Face tutorial.)

Given that, should I still have posted on the forum first? If the tutorials for data processing and model training can't be combined, how is one supposed to train a model on the processed data? It seemed like something that should be fixed in the code, rather than just discussed on the forum.

I don't believe the notebook you linked to is shared with me.

Thanks again for engaging with this!

gante · 2022-04-27T09:33:56Z

I don't believe the notebook you linked to is shared with me.

Oops, forgot to change the permissions. Should be okay now

Jadiker · 2022-04-27T09:38:27Z

After looking at the notebook you linked, it seems like the issue is that the tutorial notebook gives two different options for tokenizing text - by using both of them, rather than just using the first one, I introduced a bug into the code.

Does that sounds accurate?

gante · 2022-04-27T10:59:09Z

@Jadiker Yeah, the problem seems to be at the dataset preparation stage. To be candid, I also can't find the issue from a quick glance -- I've double checked the input_ids, they are all within vocabulary_size, so gather shouldn't complain 🤔 Can you have a look at the script example here, which was working as of a few weeks ago (and should be working), and see if you can find the issue? My number 1 suspect is the lack of a labels column, but the thrown error does not point at that.

As I mentioned above, we don't have the resources to do proper support in situations like this, but I'd be curious to find the root cause. Perhaps we could improve documentation with the findings :) If you get stuck, I might have capacity to pick it up in a few weeks.

Jadiker added the bug label Apr 20, 2022

gante assigned gante and unassigned gante Apr 21, 2022

gante closed this as completed Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Range Error for BERT Masked Language Modeling on IMDB #16846

Range Error for BERT Masked Language Modeling on IMDB #16846

Jadiker commented Apr 20, 2022

gante commented Apr 21, 2022 •

edited

Loading

Jadiker commented Apr 26, 2022 •

edited

Loading

gante commented Apr 26, 2022

Jadiker commented Apr 27, 2022

gante commented Apr 27, 2022 •

edited

Loading

Jadiker commented Apr 27, 2022

gante commented Apr 27, 2022 •

edited

Loading

Range Error for BERT Masked Language Modeling on IMDB #16846

Range Error for BERT Masked Language Modeling on IMDB #16846

Comments

Jadiker commented Apr 20, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

gante commented Apr 21, 2022 • edited Loading

Jadiker commented Apr 26, 2022 • edited Loading

gante commented Apr 26, 2022

Jadiker commented Apr 27, 2022

gante commented Apr 27, 2022 • edited Loading

Jadiker commented Apr 27, 2022

gante commented Apr 27, 2022 • edited Loading

gante commented Apr 21, 2022 •

edited

Loading

Jadiker commented Apr 26, 2022 •

edited

Loading

gante commented Apr 27, 2022 •

edited

Loading

gante commented Apr 27, 2022 •

edited

Loading