Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chatllama]: MultiGPU support for training #254

Open
TejaGollapudi opened this issue Mar 11, 2023 · 8 comments
Open

[Chatllama]: MultiGPU support for training #254

TejaGollapudi opened this issue Mar 11, 2023 · 8 comments

Comments

@TejaGollapudi
Copy link

I'm trying to train the actor model (BLOOM 1.5B) on a multi-GPU setup (3-V100s).
When I observe the GPU usage, only the GPU:0 is utilized and I run out of memory if I increase the batch_size.

Could you add multi-GPU support using HuggingFace's accelerate to facilitate the training of larger models with a larger batch size?

Thank you

@diegofiori
Copy link
Collaborator

Hi @TejaGollapudi, thank you very much for reaching out. We are currently working on supporting the Accelerate library. You can check directly the updates on the PR #233.

@leonselina
Copy link

I added accelerate in the code as #233 ,bug got error:
Traceback (most recent call last):
File "/nvmessd0/nebullvm/apps/accelerate/chatllama/artifacts/main.py", line 3, in
from chatllama.rlhf.actor import ActorTrainer
File "/home/spzq/.local/lib/python3.10/site-packages/chatllama/rlhf/actor.py", line 17, in
from chatllama.rlhf.config import ConfigActor
File "/home/spzq/.local/lib/python3.10/site-packages/chatllama/rlhf/config.py", line 71, in
class ConfigActor:
File "/usr/lib/python3.10/dataclasses.py", line 1187, in dataclass
return wrap(cls)
File "/usr/lib/python3.10/dataclasses.py", line 1178, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
File "/usr/lib/python3.10/dataclasses.py", line 1027, in _process_class
_init_fn(all_init_fields,
File "/usr/lib/python3.10/dataclasses.py", line 548, in _init_fn
raise TypeError(f'non-default argument {f.name!r} '
TypeError: non-default argument 'device' follows default argument

@PierpaoloSorbellini
Copy link
Collaborator

@leonselina We will be releasing support for Accelerate very soon! We are currently testing the code and will keep you updated when we merge the code!

@PierpaoloSorbellini PierpaoloSorbellini changed the title [chatllama]: MultiGPU support for training [Chatllama]: MultiGPU support for training Mar 14, 2023
@balcklive
Copy link

when would this MultiGPU support be available? Really looking forward to it.

@bin123apple
Copy link

Also looking forward to it!

@PierpaoloSorbellini
Copy link
Collaborator

PierpaoloSorbellini commented Apr 3, 2023

Hi Everyone @bin123apple @balcklive @TejaGollapudi .
you can try the PR #306 where deepspeed and accelerate should be working fine.
keep in mind to launch the training with "deepspeed arifacts/main.py .." or "accelerate launch" instead of using "python"
If you have any other problem on the matter let me know!

@leonselina
Copy link

Hi @PierpaoloSorbellini , I trained Llama 7B with deepspeed, but got error: "MP=1 but world size is 2".
How can I train Llama 7B with multi-GPU? because the limits of VRAM , maybe I should use model_parallel instead data_parallel for multi-GPU training.
thanx:)

@laozhanghahaha
Copy link

@PierpaoloSorbellini hey I try to try llama with hf format and I use deepseep with --num_gpus =2. The model was loaded twice and they were all loaded to the rank0 gpu which caused cuda oom.

image

do you have ideas to fix this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants