Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range Error for BERT Masked Language Modeling on IMDB #16846

Closed
2 of 4 tasks
Jadiker opened this issue Apr 20, 2022 · 7 comments
Closed
2 of 4 tasks

Range Error for BERT Masked Language Modeling on IMDB #16846

Jadiker opened this issue Apr 20, 2022 · 7 comments
Assignees
Labels

Comments

@Jadiker
Copy link

Jadiker commented Apr 20, 2022

System Info

- `transformers` version: 4.18.0
- Platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.10.0+cu111 (False)
- Tensorflow version (GPU?): 2.8.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

Who can help?

@LysandreJik

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

https://colab.research.google.com/drive/1ZpYRkJVMF5r3MukUheEFtgDvqax4YCxM?usp=sharing

Expected behavior

Evaluation to complete and give me a perplexity score, as it does [here](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/chapter7/section3_tf.ipynb)
@Jadiker Jadiker added the bug label Apr 20, 2022
@gante gante assigned gante and unassigned gante Apr 21, 2022
@gante
Copy link
Member

gante commented Apr 21, 2022

Hi @Jadiker 👋 In your notebook, after the tokenize_and_chunk cell (the 2nd one, there are 2) we can see a warning that explains the error: Token indices sequence length is longer than the specified maximum sequence length for this model (521 > 512). Running this sequence through the model will result in indexing errors.

If you add truncation=True in the tokenizer call in that cell you should be able to solve the problem. Let me know if it worked :)

@Jadiker
Copy link
Author

Jadiker commented Apr 26, 2022

Nope, same error: indices[15,32] = -9223372036854775808 is not in [0, 28996). (I've edited the notebook with the change.)

@gante
Copy link
Member

gante commented Apr 26, 2022

@Jadiker Thank you for the update :) The problem seems to raise from your custom tokenization function, which is likely not returning the correct data format. See this notebook, which successfully runs your code if we skip tokenize_and_chunk. Inside tokenize_and_chunk, you append chunks.append(all_input_ids[idx: idx + context_length]), which would explain the indexing errors.

We also reserve these GitHub issues for bugs in the repository and/or feature requests. For any other requests, like issues in your custom code, we'd like to invite you to use our forum 🤗 I'm closing this issue, but feel free to reopen with queries that fit the criteria I described.

@gante gante closed this as completed Apr 26, 2022
@Jadiker
Copy link
Author

Jadiker commented Apr 27, 2022

@gante Thanks for your time and for the information! I really appreciate it.

Two comments:

  1. All the code (including the tokenize_and_chunk function) in the notebook is directly from Hugging Face. It comes from this notebook which is linked in this tutorial on data processing. The only thing I have done is added code after the data processing in order to actually train a model on the processed data. (And the code for training the model comes from this Hugging Face tutorial.)

Given that, should I still have posted on the forum first? If the tutorials for data processing and model training can't be combined, how is one supposed to train a model on the processed data? It seemed like something that should be fixed in the code, rather than just discussed on the forum.

  1. I don't believe the notebook you linked to is shared with me.

Thanks again for engaging with this!

@gante
Copy link
Member

gante commented Apr 27, 2022

I don't believe the notebook you linked to is shared with me.

Oops, forgot to change the permissions. Should be okay now

@Jadiker
Copy link
Author

Jadiker commented Apr 27, 2022

After looking at the notebook you linked, it seems like the issue is that the tutorial notebook gives two different options for tokenizing text - by using both of them, rather than just using the first one, I introduced a bug into the code.

Does that sounds accurate?

@gante
Copy link
Member

gante commented Apr 27, 2022

@Jadiker Yeah, the problem seems to be at the dataset preparation stage. To be candid, I also can't find the issue from a quick glance -- I've double checked the input_ids, they are all within vocabulary_size, so gather shouldn't complain 🤔 Can you have a look at the script example here, which was working as of a few weeks ago (and should be working), and see if you can find the issue? My number 1 suspect is the lack of a labels column, but the thrown error does not point at that.

As I mentioned above, we don't have the resources to do proper support in situations like this, but I'd be curious to find the root cause. Perhaps we could improve documentation with the findings :) If you get stuck, I might have capacity to pick it up in a few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants