-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexError: Invalid key: 47682 is out of bounds for size 0 while using PEFT #6535
Comments
This is surely the same issue as https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25 that comes from the Also check your logs: the |
the same error |
* Add valid columns checking in _remove_unused_columns method huggingface/datasets#6973 (comment) huggingface/datasets#6535 https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298/25 * Update modeling_mixtral.py * Update modeling_mixtral.py * Update modeling_mixtral.py
Describe the bug
I am trying to fine-tune the t5 model on the paraphrasing task. While running the same code without-
model = get_peft_model(model, config)
the model trains without any issues. However, using the model returned from get_peft_model raises the following error due to datasets-
IndexError: Invalid key: 47682 is out of bounds for size 0.
I had raised this in huggingface/peft#1299 (comment) and they suggested that I raise it here.
Here is the complete error-
IndexError Traceback (most recent call last)
in <cell line: 1>()
----> 1 trainer.train()
11 frames
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1553 hf_hub_utils.enable_progress_bars()
1554 else:
-> 1555 return inner_training_loop(
1556 args=args,
1557 resume_from_checkpoint=resume_from_checkpoint,
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1836
1837 step = -1
-> 1838 for step, inputs in enumerate(epoch_iterator):
1839 total_batched_samples += 1
1840 if rng_to_sync:
/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py in iter(self)
446 # We iterate one batch ahead to check when we are at the end
447 try:
--> 448 current_batch = next(dataloader_iter)
449 except StopIteration:
450 yield
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in next(self)
628 # TODO(pytorch/pytorch#76750)
629 self._reset() # type: ignore[call-arg]
--> 630 data = self._next_data()
631 self._num_yielded += 1
632 if self._dataset_kind == _DatasetKind.Iterable and \
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
672 def _next_data(self):
673 index = self._next_index() # may raise StopIteration
--> 674 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
675 if self._pin_memory:
676 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
47 if self.auto_collation:
48 if hasattr(self.dataset, "getitems") and self.dataset.getitems:
---> 49 data = self.dataset.getitems(possibly_batched_index)
50 else:
51 data = [self.dataset[idx] for idx in possibly_batched_index]
/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in getitems(self, keys)
2802 def getitems(self, keys: List) -> List:
2803 """Can be used to get a batch using a list of integers indices."""
-> 2804 batch = self.getitem(keys)
2805 n_examples = len(batch[next(iter(batch))])
2806 return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in getitem(self, key)
2798 def getitem(self, key): # noqa: F811
2799 """Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools)."""
-> 2800 return self._getitem(key)
2801
2802 def getitems(self, keys: List) -> List:
/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in _getitem(self, key, **kwargs)
2782 format_kwargs = format_kwargs if format_kwargs is not None else {}
2783 formatter = get_formatter(format_type, features=self._info.features, **format_kwargs)
-> 2784 pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
2785 formatted_output = format_table(
2786 pa_subtable, key, formatter=formatter, format_columns=format_columns, output_all_columns=output_all_columns
/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in query_table(table, key, indices)
581 else:
582 size = indices.num_rows if indices is not None else table.num_rows
--> 583 _check_valid_index_key(key, size)
584 # Query the main table
585 if indices is None:
/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
534 elif isinstance(key, Iterable):
535 if len(key) > 0:
--> 536 _check_valid_index_key(int(max(key)), size=size)
537 _check_valid_index_key(int(min(key)), size=size)
538 else:
/usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py in _check_valid_index_key(key, size)
524 if isinstance(key, int):
525 if (key < 0 and key + size < 0) or (key >= size):
--> 526 raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
527 return
528 elif isinstance(key, slice):
IndexError: Invalid key: 47682 is out of bounds for size 0
Steps to reproduce the bug
device = "cuda:0" if torch.cuda.is_available() else "cpu"
#defining model name for tokenizer and model loading
model_name= "t5-small"
#loading the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
def preprocess_function(data, tokenizer):
inputs = [f"Paraphrase this sentence: {doc}" for doc in data["text"]]
model_inputs = tokenizer(inputs, max_length=150, truncation=True)
labels = [ast.literal_eval(i)[0] for i in data['paraphrases']]
labels = tokenizer(labels, max_length=150, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
train_dataset = load_dataset("humarin/chatgpt-paraphrases", split="train").shuffle(seed=42).select(range(50000))
val_dataset = load_dataset("humarin/chatgpt-paraphrases", split="train").shuffle(seed=42).select(range(50000,55000))
tokenized_train = train_dataset.map(lambda batch: preprocess_function(batch, tokenizer), batched=True)
tokenized_val = val_dataset.map(lambda batch: preprocess_function(batch, tokenizer), batched=True)
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
config = LoraConfig(
r=16, #attention heads
lora_alpha=32, #alpha scaling
lora_dropout=0.05,
bias="none",
task_type="Seq2Seq"
)
#loading the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
model = get_peft_model(model, config)
print_trainable_parameters(model)
#loading the data collator
data_collator = DataCollatorForSeq2Seq(
tokenizer=tokenizer,
model=model,
label_pad_token_id=-100,
padding="longest"
)
#defining the training arguments
training_args = Seq2SeqTrainingArguments(
output_dir=os.getcwd(),
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
weight_decay=1e-3,
save_total_limit=3,
load_best_model_at_end=True,
num_train_epochs=1,
predict_with_generate=True
)
def compute_metric_with_extra(tokenizer):
def compute_metrics(eval_preds):
metric = evaluate.load('rouge')
preds, labels = eval_preds
return compute_metrics
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics= compute_metric_with_extra(tokenizer)
)
trainer.train()
Expected behavior
I would want the trainer to train normally as it was before I used-
model = get_peft_model(model, config)
Environment info
datasets version- 2.16.0
peft version- 0.7.1
transformers version- 4.35.2
accelerate version- 0.25.0
python- 3.10.12
enviroment- google colab
The text was updated successfully, but these errors were encountered: