Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: Invalid key: 47682 is out of bounds for size 0 while using PEFT #6535

Open
MahavirDabas18 opened this issue Dec 26, 2023 · 3 comments

Comments

@MahavirDabas18
Copy link

Describe the bug

I am trying to fine-tune the t5 model on the paraphrasing task. While running the same code without-

model = get_peft_model(model, config)

the model trains without any issues. However, using the model returned from get_peft_model raises the following error due to datasets-

IndexError: Invalid key: 47682 is out of bounds for size 0.

I had raised this in huggingface/peft#1299 (comment) and they suggested that I raise it here.

Here is the complete error-

IndexError Traceback (most recent call last)
in <cell line: 1>()
----> 1 trainer.train()

11 frames
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1553 hf_hub_utils.enable_progress_bars()
1554 else:
-> 1555 return inner_training_loop(
1556 args=args,
1557 resume_from_checkpoint=resume_from_checkpoint,

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1836
1837 step = -1
-> 1838 for step, inputs in enumerate(epoch_iterator):
1839 total_batched_samples += 1
1840 if rng_to_sync:

/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py in iter(self)
446 # We iterate one batch ahead to check when we are at the end
447 try:
--> 448 current_batch = next(dataloader_iter)
449 except StopIteration:
450 yield

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in next(self)
628 # TODO(pytorch/pytorch#76750)
629 self._reset() # type: ignore[call-arg]
--> 630 data = self._next_data()
631 self._num_yielded += 1
632 if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
672 def _next_data(self):
673 index = self._next_index() # may raise StopIteration
--> 674 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
675 if self._pin_memory:
676 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
47 if self.auto_collation:
48 if hasattr(self.dataset, "getitems") and self.dataset.getitems:
---> 49 data = self.dataset.getitems(possibly_batched_index)
50 else:
51 data = [self.dataset[idx] for idx in possibly_batched_index]

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in getitems(self, keys)
2802 def getitems(self, keys: List) -> List:
2803 """Can be used to get a batch using a list of integers indices."""
-> 2804 batch = self.getitem(keys)
2805 n_examples = len(batch[next(iter(batch))])
2806 return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in getitem(self, key)
2798 def getitem(self, key): # noqa: F811
2799 """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2800 return self._getitem(key)
2801
2802 def getitems(self, keys: List) -> List:

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in _getitem(self, key, **kwargs)
2782 format_kwargs = format_kwargs if format_kwargs is not None else {}
2783 formatter = get_formatter(format_type, features=self._info.features, **format_kwargs)
-> 2784 pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
2785 formatted_output = format_table(
2786 pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in query_table(table, key, indices)
581 else:
582 size = indices.num_rows if indices is not None else table.num_rows
--> 583 _check_valid_index_key(key, size)
584 # Query the main table
585 if indices is None:

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
534 elif isinstance(key, Iterable):
535 if len(key) > 0:
--> 536 _check_valid_index_key(int(max(key)), size=size)
537 _check_valid_index_key(int(min(key)), size=size)
538 else:

/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
524 if isinstance(key, int):
525 if (key < 0 and key + size < 0) or (key >= size):
--> 526 raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
527 return
528 elif isinstance(key, slice):

IndexError: Invalid key: 47682 is out of bounds for size 0

Steps to reproduce the bug

device = "cuda:0" if torch.cuda.is_available() else "cpu"

#defining model name for tokenizer and model loading
model_name= "t5-small"
#loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

def preprocess_function(data, tokenizer):
inputs = [f"Paraphrase this sentence: {doc}" for doc in data["text"]]
model_inputs = tokenizer(inputs, max_length=150, truncation=True)
labels = [ast.literal_eval(i)[0] for i in data['paraphrases']]
labels = tokenizer(labels, max_length=150, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs

train_dataset = load_dataset("humarin/chatgpt-paraphrases", split="train").shuffle(seed=42).select(range(50000))
val_dataset = load_dataset("humarin/chatgpt-paraphrases", split="train").shuffle(seed=42).select(range(50000,55000))

tokenized_train = train_dataset.map(lambda batch: preprocess_function(batch, tokenizer), batched=True)
tokenized_val = val_dataset.map(lambda batch: preprocess_function(batch, tokenizer), batched=True)

def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)

config = LoraConfig(
r=16, #attention heads
lora_alpha=32, #alpha scaling
lora_dropout=0.05,
bias="none",
task_type="Seq2Seq"
)

#loading the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

model = get_peft_model(model, config)
print_trainable_parameters(model)

#loading the data collator
data_collator = DataCollatorForSeq2Seq(
tokenizer=tokenizer,
model=model,
label_pad_token_id=-100,
padding="longest"
)

#defining the training arguments
training_args = Seq2SeqTrainingArguments(
output_dir=os.getcwd(),
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
weight_decay=1e-3,
save_total_limit=3,
load_best_model_at_end=True,
num_train_epochs=1,
predict_with_generate=True
)

def compute_metric_with_extra(tokenizer):
def compute_metrics(eval_preds):
metric = evaluate.load('rouge')
preds, labels = eval_preds

# decode preds and labels
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

# rougeLSum expects newline after each sentence
decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
return result

return compute_metrics
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics= compute_metric_with_extra(tokenizer)
)

trainer.train()

Expected behavior

I would want the trainer to train normally as it was before I used-

model = get_peft_model(model, config)

Environment info

datasets version- 2.16.0
peft version- 0.7.1
transformers version- 4.35.2
accelerate version- 0.25.0
python- 3.10.12
enviroment- google colab

@MahavirDabas18
Copy link
Author

@sabman @pvl @kashif @vigsterkr

@lhoestq
Copy link
Member

lhoestq commented Jan 2, 2024

This is surely the same issue as https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25 that comes from the transformers Trainer. You should add remove_unused_columns=False to TrainingArguments

Also check your logs: the Trainer should log the length of your dataset before training starts and it surely showed length=0.

@vip-china
Copy link

the same error
IndexError: Invalid key: 22330 is out of bounds for size 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants