Negative KL even after using recommended generation kwargs #1005

Ali1858 · 2023-11-16T11:23:49Z

Hi, I am fine-tuning the PPOTrainer on the open-assistant dataset using 4-bit Qlora using the stack-llama example as a reference. However, I am getting negative KL from the very first step even after using the recommended generation kwargs.

generation_kwargs = {
        # "min_length": -1,
        "top_k": 0.0,
        "top_p": 1.0,
        "do_sample": True,
        "pad_token_id": tokenizer.pad_token_id,
        "eos_token_id": 100_000,
        "max_new_tokens": max_new_tokens,
    }

So far I have tried different following things. However, all these configurations give me negative KL.

learning rate [1.41e-5, 1e-5, 1e-4, 1.41e-5 (with cosine scheduler)]
init_kl_coef of 0.2 and 0.02
clip_range and clip value of [default, 0.15]

A few things I have noticed:

objective/logprob and objective/ref_logprob are not widely distributed as stack llama example
Mean reward is still going up but KL is going down.
Generated response is still coherent and make sense.

Any help is much appreciated. Here is my wandb logs for more info.

The text was updated successfully, but these errors were encountered:

Ali1858 · 2023-11-16T13:31:56Z

I noticed one more thing. Logits from the reward model are inconsistent, I suspect this could be the reason for negative KL. Please check this issue

Ali1858 · 2023-12-08T19:29:47Z

I solved the negative KL issue by fine-tuning the SFT model instead of Lora-finetuning. I am unsure why it worked or even have conclusive proof to make this statement. For now, KL is much more stable.

lvwerra · 2023-12-21T16:35:01Z

That is very odd, but closing for now since the issue seems resolved. @younesbelkada for visibility.

lindseyfeng · 2024-01-22T06:20:09Z

I think I also encounter this problem while trying to do PPO on a lora-finetuned Llama-2-7b. by 80 or so steps, I think the model has learnt to output gibberish for best rewards even though the env/reward graph looks good and was increasing (?). I had KL = "abs" because I thought this could potentially downplay the negative kl thing. I don't have access to a fully fine-tuned Llama-2-7b and am wondering if you have any experience on this matter in this case

lvwerra closed this as completed Dec 21, 2023

Ali1858 mentioned this issue Jan 18, 2024

Negative loss and KL when using PPO trainer with llama2 #1017

Closed

wrmthorne mentioned this issue Feb 19, 2024

Inncorrect reference model used when using pretrained policy adapters #1340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative KL even after using recommended generation kwargs #1005

Negative KL even after using recommended generation kwargs #1005

Ali1858 commented Nov 16, 2023

Ali1858 commented Nov 16, 2023

Ali1858 commented Dec 8, 2023

lvwerra commented Dec 21, 2023

lindseyfeng commented Jan 22, 2024 •

edited

Loading

Negative KL even after using recommended generation kwargs #1005

Negative KL even after using recommended generation kwargs #1005

Comments

Ali1858 commented Nov 16, 2023

Ali1858 commented Nov 16, 2023

Ali1858 commented Dec 8, 2023

lvwerra commented Dec 21, 2023

lindseyfeng commented Jan 22, 2024 • edited Loading

lindseyfeng commented Jan 22, 2024 •

edited

Loading