IndexError during training with Squad dataset and T5-small model #6973

ramtunguturi36 · 2024-06-16T07:53:54Z

Describe the bug

I am encountering an IndexError while training a T5-small model on the Squad dataset using the transformers and datasets libraries. The error occurs even with a minimal reproducible example, suggesting a potential bug or incompatibility.

Steps to reproduce the bug

1.Install the required libraries: !pip install transformers datasets
2.Run the following code:
!pip install transformers datasets

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, DataCollatorWithPadding

Load a small, publicly available dataset

from datasets import load_dataset
dataset = load_dataset("squad", split="train[:100]") # Use a small subset for testing

Load a pre-trained model and tokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Define a basic data collator

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Define training arguments

training_args = TrainingArguments(
output_dir="./results",
per_device_train_batch_size=2,
num_train_epochs=1,
)

Create a trainer

trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator,
)

Train the model

trainer.train()

Expected behavior

IndexError Traceback (most recent call last)
in <cell line: 34>()
32
33 # Train the model
---> 34 trainer.train()

10 frames
/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
427 if isinstance(key, int):
428 if (key < 0 and key + size < 0) or (key >= size):
--> 429 raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
430 return
431 elif isinstance(key, slice):

IndexError: Invalid key: 42 is out of bounds for size 0

Environment info

transformers version:4.41.2
datasets version:1.18.4
Python version:3.10.12

arthasking123 · 2024-06-18T08:06:13Z

add remove_unused_columns=False to training_args
#6535 (comment)

huggingface/datasets#6973 (comment) huggingface/datasets#6535 https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25

* Add valid columns checking in _remove_unused_columns method huggingface/datasets#6973 (comment) huggingface/datasets#6535 https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25 * Update modeling_mixtral.py * Update modeling_mixtral.py * Update modeling_mixtral.py

albertvillanova · 2024-07-01T11:25:40Z

Closing this issue because it was a reported and fixed in transformers.

arthasking123 mentioned this issue Jun 18, 2024

Add valid columns check in _remove_unused_columns method huggingface/transformers#31466

Merged

5 tasks

albertvillanova closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError during training with Squad dataset and T5-small model #6973

IndexError during training with Squad dataset and T5-small model #6973

ramtunguturi36 commented Jun 16, 2024

arthasking123 commented Jun 18, 2024 •

edited

Loading

albertvillanova commented Jul 1, 2024

IndexError during training with Squad dataset and T5-small model #6973

IndexError during training with Squad dataset and T5-small model #6973

Comments

ramtunguturi36 commented Jun 16, 2024

Describe the bug

Steps to reproduce the bug

Load a small, publicly available dataset

Load a pre-trained model and tokenizer

Define a basic data collator

Define training arguments

Create a trainer

Train the model

Expected behavior

Environment info

arthasking123 commented Jun 18, 2024 • edited Loading

albertvillanova commented Jul 1, 2024

arthasking123 commented Jun 18, 2024 •

edited

Loading