Refactor of models and trainers with base class for common methods #306

PierpaoloSorbellini · 2023-03-27T16:58:35Z

Refactor models and trainers to avoid code replication.
Added logs with loguru package.
Fix logs with MultiGPU trainers.
Added support for LoRA with PEFT library.
Added support for load_8bit option with HF models.
Added self-instruct dataset of HF.
Added CerebrasGPT and Decapoda LLaMA models from HF.
Added mixed-precision training to reduce GPU memory requirements.
Fixed RLHF KL divergence equation.
Added support to keep only the last n checkpoints for all training.
Added generation of negative examples when creating the reward dataset to improve the quality of the reward model.
Improved stability of MultiGPU training with both Accelerate form HF and DeepSpeed.

This reverts commit 156fa19.

diegofiori

LGTM!

… refactor

apps/accelerate/chatllama/artifacts/config/config.yaml

apps/accelerate/chatllama/artifacts/download_dataset.py

apps/accelerate/chatllama/chatllama/rlhf/actor.py

diegofiori · 2023-04-04T18:15:41Z

apps/accelerate/chatllama/chatllama/rlhf/actor.py

+                    # pytorch mixed precison
+                    with torch.autocast(
+                        device_type=self.config.device_type,
+                        dtype=torch.float16,


do we need to auto-cast to fp16 all the tensors? Shouldn't this be a config param?

just following documentation...
https://pytorch.org/docs/stable/notes/amp_examples.html
wrt to casting manually the tensors, this is better with less problem with types in the embedding.
It is not a config param because if you do not use fp16 you would use fp32 and is probably worse.
not seen the point of adding the option for fp32.

But what if I want to train the model in fp32 precision? (DeepSpeed for instance allows the user to select the precision)

apps/accelerate/chatllama/chatllama/rlhf/actor.py

apps/accelerate/chatllama/chatllama/rlhf/config.py

diegofiori · 2023-04-16T10:06:37Z

@PierpaoloSorbellini please add a description of what this PR is adding in terms of features and which bugs it is fixing.

PierpaoloSorbellini added 5 commits March 27, 2023 16:48

Refactor models and trainers with base_class for common methods

2b8e301

Revert "Release ChatLLaMA 0.0.4"

5e0ded8

This reverts commit 156fa19.

Merge branch 'main' of https://github.com/nebuly-ai/nebullvm into main

3fa5c53

Refactor of models and trainers with base class for common methods

ab1f09e

Fix comments and values in the config.yaml

3d54d50

diegofiori approved these changes Mar 27, 2023

View reviewed changes

PierpaoloSorbellini and others added 17 commits March 27, 2023 19:32

Add load 8 bit from HF

9f5eab4

Add check on load int 8

dc46ee4

Add Reward and Critic support for LoRA PEFT

c1d03d3

Add SelfInstruct Dataset from HF

36c350d

Fix imports

bb92ee7

Add logging with proper class

6fc94d3

Fix logs for deepspeed

dc2489f

Fix early logs with multi-GPUs

0b0795d

Fix MultiGPU for accelerate

01be6dc

Fix batch-size for accelerate

13b1abd

Add multi gpu training to readme.md

db8b3c2

Fix fp16 training

d771fb2

Merge branch 'main' into refactor

e5f959c

Fix Distributed training for RLHF

d5084e5

Add new models

2ec5eaa

Add decapoda models

33e97e2

Add unsupported model message

8332a26

PierpaoloSorbellini added 7 commits April 3, 2023 15:35

Change sing to KL div accordingly to issue #298

32ddfa2

Fix imports order

aa9881c

Add cases for lora-peft model loading

b10f1dc

Merge branch 'refactor' of https://github.com/nebuly-ai/nebullvm into…

86a699b

… refactor

Fix Actor 8bit training

1f29ba4

Adjust code comments to match new adjustments

1836788

Fix device error when using vanilla pytorch trainig

966a19d

diegofiori reviewed Apr 4, 2023

View reviewed changes

PierpaoloSorbellini added 3 commits April 5, 2023 13:51

Fix RLHF with fp16

feacb88

Move grad scaler into base class

f894494

Add check on 8bit load and distributed training

b56185f

PierpaoloSorbellini mentioned this pull request Apr 12, 2023

Issues with accelerate and deepspeed training #331

Open

PierpaoloSorbellini added 10 commits April 12, 2023 08:02

Add template to self-instruct dataset

5699aaa

Fix checkpoints name in actor training

5c83927

Fix slow loss computation

a205ee6

Fix checkpoints also in reward models

bb386c4

Fix checkpoint for rl

22a64af

Add n_checkpoints for all the training with old checkpoints removal

10211c6

Improve datasets quality with reward model negative examples

442b396

Merge branch 'main' of https://github.com/nebuly-ai/nebullvm into main

71a6c02

Merge branch 'main' into refactor

1189787

Fix merge issues

98b96c2

PierpaoloSorbellini mentioned this pull request Apr 14, 2023

[Chatllama] KL Divergence equation #298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of models and trainers with base class for common methods #306

Refactor of models and trainers with base class for common methods #306

PierpaoloSorbellini commented Mar 27, 2023 •

edited

Loading

diegofiori left a comment

diegofiori Apr 4, 2023

PierpaoloSorbellini Apr 5, 2023

diegofiori Apr 16, 2023

diegofiori commented Apr 16, 2023

Refactor of models and trainers with base class for common methods #306

Are you sure you want to change the base?

Refactor of models and trainers with base class for common methods #306

Conversation

PierpaoloSorbellini commented Mar 27, 2023 • edited Loading

diegofiori left a comment

Choose a reason for hiding this comment

diegofiori Apr 4, 2023

Choose a reason for hiding this comment

PierpaoloSorbellini Apr 5, 2023

Choose a reason for hiding this comment

diegofiori Apr 16, 2023

Choose a reason for hiding this comment

diegofiori commented Apr 16, 2023

PierpaoloSorbellini commented Mar 27, 2023 •

edited

Loading