-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative KL even after using recommended generation kwargs #1005
Comments
I noticed one more thing. Logits from the reward model are inconsistent, I suspect this could be the reason for negative KL. Please check this issue |
I solved the negative KL issue by fine-tuning the SFT model instead of Lora-finetuning. I am unsure why it worked or even have conclusive proof to make this statement. For now, KL is much more stable. |
That is very odd, but closing for now since the issue seems resolved. @younesbelkada for visibility. |
I think I also encounter this problem while trying to do PPO on a lora-finetuned Llama-2-7b. by 80 or so steps, I think the model has learnt to output gibberish for best rewards even though the env/reward graph looks good and was increasing (?). I had KL = "abs" because I thought this could potentially downplay the negative kl thing. I don't have access to a fully fine-tuned Llama-2-7b and am wondering if you have any experience on this matter in this case |
Hi, I am fine-tuning the PPOTrainer on the open-assistant dataset using 4-bit Qlora using the stack-llama example as a reference. However, I am getting negative KL from the very first step even after using the recommended generation kwargs.
So far I have tried different following things. However, all these configurations give me negative KL.
A few things I have noticed:
Any help is much appreciated. Here is my wandb logs for more info.
The text was updated successfully, but these errors were encountered: