IndexError: Invalid key: 47682 is out of bounds for size 0 while using PEFT #6535

MahavirDabas18 · 2023-12-26T10:14:33Z

Describe the bug

I am trying to fine-tune the t5 model on the paraphrasing task. While running the same code without-

model = get_peft_model(model, config)

the model trains without any issues. However, using the model returned from get_peft_model raises the following error due to datasets-

IndexError: Invalid key: 47682 is out of bounds for size 0.

I had raised this in huggingface/peft#1299 (comment) and they suggested that I raise it here.

Here is the complete error-

IndexError Traceback (most recent call last)
in <cell line: 1>()
----> 1 trainer.train()

11 frames
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1553 hf_hub_utils.enable_progress_bars()
1554 else:
-> 1555 return inner_training_loop(
1556 args=args,
1557 resume_from_checkpoint=resume_from_checkpoint,

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1836
1837 step = -1
-> 1838 for step, inputs in enumerate(epoch_iterator):
1839 total_batched_samples += 1
1840 if rng_to_sync:

/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py in iter(self)
446 # We iterate one batch ahead to check when we are at the end
447 try:
--> 448 current_batch = next(dataloader_iter)
449 except StopIteration:
450 yield

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in next(self)
628 # TODO(pytorch/pytorch#76750)
629 self._reset() # type: ignore[call-arg]
--> 630 data = self._next_data()
631 self._num_yielded += 1
632 if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
672 def _next_data(self):
673 index = self._next_index() # may raise StopIteration
--> 674 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
675 if self._pin_memory:
676 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
47 if self.auto_collation:
48 if hasattr(self.dataset, "getitems") and self.dataset.getitems:
---> 49 data = self.dataset.getitems(possibly_batched_index)
50 else:
51 data = [self.dataset[idx] for idx in possibly_batched_index]

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in getitems(self, keys)
2802 def getitems(self, keys: List) -> List:
2803 """Can be used to get a batch using a list of integers indices."""
-> 2804 batch = self.getitem(keys)
2805 n_examples = len(batch[next(iter(batch))])
2806 return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in getitem(self, key)
2798 def getitem(self, key): # noqa: F811
2799 """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2800 return self._getitem(key)
2801
2802 def getitems(self, keys: List) -> List:

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in _getitem(self, key, **kwargs)
2782 format_kwargs = format_kwargs if format_kwargs is not None else {}
2783 formatter = get_formatter(format_type, features=self._info.features, **format_kwargs)
-> 2784 pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
2785 formatted_output = format_table(
2786 pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in query_table(table, key, indices)
581 else:
582 size = indices.num_rows if indices is not None else table.num_rows
--> 583 _check_valid_index_key(key, size)
584 # Query the main table
585 if indices is None:

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
534 elif isinstance(key, Iterable):
535 if len(key) > 0:
--> 536 _check_valid_index_key(int(max(key)), size=size)
537 _check_valid_index_key(int(min(key)), size=size)
538 else:

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
524 if isinstance(key, int):
525 if (key < 0 and key + size < 0) or (key >= size):
--> 526 raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
527 return
528 elif isinstance(key, slice):

IndexError: Invalid key: 47682 is out of bounds for size 0

Steps to reproduce the bug

device = "cuda:0" if torch.cuda.is_available() else "cpu"

#defining model name for tokenizer and model loading
model_name= "t5-small"
#loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(data, tokenizer):
inputs = [f"Paraphrase this sentence: {doc}" for doc in data["text"]]
model_inputs = tokenizer(inputs, max_length=150, truncation=True)
labels = [ast.literal_eval(i)[0] for i in data['paraphrases']]
labels = tokenizer(labels, max_length=150, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs

train_dataset = load_dataset("humarin/chatgpt-paraphrases", split="train").shuffle(seed=42).select(range(50000))
val_dataset = load_dataset("humarin/chatgpt-paraphrases", split="train").shuffle(seed=42).select(range(50000,55000))

tokenized_train = train_dataset.map(lambda batch: preprocess_function(batch, tokenizer), batched=True)
tokenized_val = val_dataset.map(lambda batch: preprocess_function(batch, tokenizer), batched=True)

def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)

config = LoraConfig(
r=16, #attention heads
lora_alpha=32, #alpha scaling
lora_dropout=0.05,
bias="none",
task_type="Seq2Seq"
)

#loading the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

model = get_peft_model(model, config)
print_trainable_parameters(model)

#loading the data collator
data_collator = DataCollatorForSeq2Seq(
tokenizer=tokenizer,
model=model,
label_pad_token_id=-100,
padding="longest"
)

#defining the training arguments
training_args = Seq2SeqTrainingArguments(
output_dir=os.getcwd(),
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
weight_decay=1e-3,
save_total_limit=3,
load_best_model_at_end=True,
num_train_epochs=1,
predict_with_generate=True
)

def compute_metric_with_extra(tokenizer):
def compute_metrics(eval_preds):
metric = evaluate.load('rouge')
preds, labels = eval_preds

# decode preds and labels
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

# rougeLSum expects newline after each sentence
decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
return result

return compute_metrics
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics= compute_metric_with_extra(tokenizer)
)

trainer.train()

Expected behavior

I would want the trainer to train normally as it was before I used-

model = get_peft_model(model, config)

Environment info

datasets version- 2.16.0
peft version- 0.7.1
transformers version- 4.35.2
accelerate version- 0.25.0
python- 3.10.12
enviroment- google colab

The text was updated successfully, but these errors were encountered:

MahavirDabas18 · 2023-12-26T10:15:55Z

@sabman @pvl @kashif @vigsterkr

lhoestq · 2024-01-02T13:26:43Z

This is surely the same issue as https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25 that comes from the transformers Trainer. You should add remove_unused_columns=False to TrainingArguments

Also check your logs: the Trainer should log the length of your dataset before training starts and it surely showed length=0.

vip-china · 2024-02-05T08:42:30Z

the same error
IndexError: Invalid key: 22330 is out of bounds for size 0

huggingface/datasets#6973 (comment) huggingface/datasets#6535 https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25

* Add valid columns checking in _remove_unused_columns method huggingface/datasets#6973 (comment) huggingface/datasets#6535 https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25 * Update modeling_mixtral.py * Update modeling_mixtral.py * Update modeling_mixtral.py

arthasking123 mentioned this issue Jun 18, 2024

IndexError during training with Squad dataset and T5-small model #6973

Closed

arthasking123 mentioned this issue Jun 18, 2024

Add valid columns check in _remove_unused_columns method huggingface/transformers#31466

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: Invalid key: 47682 is out of bounds for size 0 while using PEFT #6535

IndexError: Invalid key: 47682 is out of bounds for size 0 while using PEFT #6535

MahavirDabas18 commented Dec 26, 2023

MahavirDabas18 commented Dec 26, 2023

lhoestq commented Jan 2, 2024

vip-china commented Feb 5, 2024

IndexError: Invalid key: 47682 is out of bounds for size 0 while using PEFT #6535

IndexError: Invalid key: 47682 is out of bounds for size 0 while using PEFT #6535

Comments

MahavirDabas18 commented Dec 26, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

MahavirDabas18 commented Dec 26, 2023

lhoestq commented Jan 2, 2024

vip-china commented Feb 5, 2024