Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative KL even after using recommended generation kwargs #1005

Closed
Ali1858 opened this issue Nov 16, 2023 · 4 comments
Closed

Negative KL even after using recommended generation kwargs #1005

Ali1858 opened this issue Nov 16, 2023 · 4 comments

Comments

@Ali1858
Copy link

Ali1858 commented Nov 16, 2023

Hi, I am fine-tuning the PPOTrainer on the open-assistant dataset using 4-bit Qlora using the stack-llama example as a reference. However, I am getting negative KL from the very first step even after using the recommended generation kwargs.

generation_kwargs = {
        # "min_length": -1,
        "top_k": 0.0,
        "top_p": 1.0,
        "do_sample": True,
        "pad_token_id": tokenizer.pad_token_id,
        "eos_token_id": 100_000,
        "max_new_tokens": max_new_tokens,
    }

So far I have tried different following things. However, all these configurations give me negative KL.

  1. learning rate [1.41e-5, 1e-5, 1e-4, 1.41e-5 (with cosine scheduler)]
  2. init_kl_coef of 0.2 and 0.02
  3. clip_range and clip value of [default, 0.15]

A few things I have noticed:

  1. objective/logprob and objective/ref_logprob are not widely distributed as stack llama example
  2. Mean reward is still going up but KL is going down.
  3. Generated response is still coherent and make sense.

Any help is much appreciated. Here is my wandb logs for more info.

@Ali1858
Copy link
Author

Ali1858 commented Nov 16, 2023

I noticed one more thing. Logits from the reward model are inconsistent, I suspect this could be the reason for negative KL. Please check this issue

@Ali1858
Copy link
Author

Ali1858 commented Dec 8, 2023

I solved the negative KL issue by fine-tuning the SFT model instead of Lora-finetuning. I am unsure why it worked or even have conclusive proof to make this statement. For now, KL is much more stable.

@lvwerra
Copy link
Member

lvwerra commented Dec 21, 2023

That is very odd, but closing for now since the issue seems resolved. @younesbelkada for visibility.

@lindseyfeng
Copy link

lindseyfeng commented Jan 22, 2024

I think I also encounter this problem while trying to do PPO on a lora-finetuned Llama-2-7b. by 80 or so steps, I think the model has learnt to output gibberish for best rewards even though the env/reward graph looks good and was increasing (?). I had KL = "abs" because I thought this could potentially downplay the negative kl thing. I don't have access to a fully fine-tuned Llama-2-7b and am wondering if you have any experience on this matter in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants