-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA results in 4-6% lower performance compared to full fine-tuning #622
Comments
Thanks for brining up this discussion, what I would try first is probably increasing the |
@younesbelkada - Thank you for the suggestion above. I did try experimenting with higher rank ( Although, based on the above suggestion, I also tried adapting all other linear layers (including FFN layers). The performance improved by 1%, however, there is still a gap of ~3% compared to full fine-tuning. |
I see thanks for the experiments, can you also double check that the LoRA weights are set on the encoder as well? |
Yes, the LoRA weights are set on both the encoder and decoder as well. Here is the list of modules for first block in
Also, I am not quantizing the base model in 8bit, instead using bf16 precision for fine-tuning. Additionally, the training dataset consists of instructions for ~20 tasks (a mixture of summarization, classification, paraphrasing, etc.) I am wondering if the setup is sensitive to hyperparameters. Specifically,
|
Hello @digvijayingle016, have you tried also making biases as trainable via |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Closing the issue, feel free to-reopen if you have more questions |
I am working on fine-tuning LLMs (6B to 40B parameters) using the LoRA framework on an instruction tuning dataset comprising of instructions corresponding to ~20 tasks (a mix of factual as well as open-ended tasks). The input to the model consists of a conversation snippet between two individuals along with a task-specific prompt. The results I am observing do not align with the performance improvements reported in the paper. Specifically, the paper reports that fine-tuning using LoRA generally results in performance at par with or better than full fine-tuning of the model, however, throughout our experiments I observe a performance lower than full fine-tuning by an absolute margin of ~4-6% in terms of RougeL score.
Sharing some of the training details below:
[Framework versions]
Python: 3.8
PyTorch: 1.13.1
Transformers: 4.27.4
PEFT: 0.3.0
[Infrastructure]
8 X A100 40 GB GPUs
[Hyper-parameter Range]
Learning rate: 5e-5 to 3e-3
Learning rate scheduler: [Constant, Linear]
Epochs: [1, 2]
Batch size: [2, 4, 8]
Weight decay: 0.0
Precision: bf16
Specifically, I tried fine-tuning of
google/flan-t5-xxl
model in following two scenarios:Scenario 1
Full fine-tuning with constant
learning rate = 5e-5
,batch size = 8
,epochs = 1
Scenario 2
Fine-tuning using LoRA with constant
learning rate = 1e-3
,batch size = 8
,epochs = 1
and LoraConfig as follows:LoraConfig(r=8, lora_alpha=16, lora_dropout=0.05, bias='none', task_type="SEQ_2_SEQ_LM")
Observation: Scenario 2 resulted in 4% lower RougeL as compared to scenario 1. I have also tried tuning the hyper-parameters in Scenario 2 as per the range specified above, however, the best I could get is to a gap of ~4% RougeL.
Thank you very much for your time and consideration. Looking forward to any relevant insights here.
The text was updated successfully, but these errors were encountered: