[Chatllama]: MultiGPU support for training #254

TejaGollapudi · 2023-03-11T22:35:17Z

I'm trying to train the actor model (BLOOM 1.5B) on a multi-GPU setup (3-V100s).
When I observe the GPU usage, only the GPU:0 is utilized and I run out of memory if I increase the batch_size.

Could you add multi-GPU support using HuggingFace's accelerate to facilitate the training of larger models with a larger batch size?

Thank you

diegofiori · 2023-03-12T09:54:26Z

Hi @TejaGollapudi, thank you very much for reaching out. We are currently working on supporting the Accelerate library. You can check directly the updates on the PR #233.

leonselina · 2023-03-14T03:10:24Z

I added accelerate in the code as #233 ,bug got error:
Traceback (most recent call last):
File "/nvmessd0/nebullvm/apps/accelerate/chatllama/artifacts/main.py", line 3, in
from chatllama.rlhf.actor import ActorTrainer
File "/home/spzq/.local/lib/python3.10/site-packages/chatllama/rlhf/actor.py", line 17, in
from chatllama.rlhf.config import ConfigActor
File "/home/spzq/.local/lib/python3.10/site-packages/chatllama/rlhf/config.py", line 71, in
class ConfigActor:
File "/usr/lib/python3.10/dataclasses.py", line 1187, in dataclass
return wrap(cls)
File "/usr/lib/python3.10/dataclasses.py", line 1178, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
File "/usr/lib/python3.10/dataclasses.py", line 1027, in _process_class
_init_fn(all_init_fields,
File "/usr/lib/python3.10/dataclasses.py", line 548, in _init_fn
raise TypeError(f'non-default argument {f.name!r} '
TypeError: non-default argument 'device' follows default argument

PierpaoloSorbellini · 2023-03-14T09:01:32Z

@leonselina We will be releasing support for Accelerate very soon! We are currently testing the code and will keep you updated when we merge the code!

balcklive · 2023-03-17T02:20:29Z

when would this MultiGPU support be available? Really looking forward to it.

bin123apple · 2023-03-17T03:05:22Z

Also looking forward to it!

PierpaoloSorbellini · 2023-04-03T14:36:01Z

Hi Everyone @bin123apple @balcklive @TejaGollapudi .
you can try the PR #306 where deepspeed and accelerate should be working fine.
keep in mind to launch the training with "deepspeed arifacts/main.py .." or "accelerate launch" instead of using "python"
If you have any other problem on the matter let me know!

leonselina · 2023-04-04T07:53:05Z

Hi @PierpaoloSorbellini , I trained Llama 7B with deepspeed, but got error: "MP=1 but world size is 2".
How can I train Llama 7B with multi-GPU? because the limits of VRAM , maybe I should use model_parallel instead data_parallel for multi-GPU training.
thanx:)

laozhanghahaha · 2023-04-04T10:09:20Z

@PierpaoloSorbellini hey I try to try llama with hf format and I use deepseep with --num_gpus =2. The model was loaded twice and they were all loaded to the rank0 gpu which caused cuda oom.

do you have ideas to fix this problem?

PierpaoloSorbellini changed the title ~~[chatllama]: MultiGPU support for training~~ [Chatllama]: MultiGPU support for training Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chatllama]: MultiGPU support for training #254

[Chatllama]: MultiGPU support for training #254

TejaGollapudi commented Mar 11, 2023

diegofiori commented Mar 12, 2023

leonselina commented Mar 14, 2023

PierpaoloSorbellini commented Mar 14, 2023

balcklive commented Mar 17, 2023

bin123apple commented Mar 17, 2023

PierpaoloSorbellini commented Apr 3, 2023 •

edited

Loading

leonselina commented Apr 4, 2023

laozhanghahaha commented Apr 4, 2023

[Chatllama]: MultiGPU support for training #254

[Chatllama]: MultiGPU support for training #254

Comments

TejaGollapudi commented Mar 11, 2023

diegofiori commented Mar 12, 2023

leonselina commented Mar 14, 2023

PierpaoloSorbellini commented Mar 14, 2023

balcklive commented Mar 17, 2023

bin123apple commented Mar 17, 2023

PierpaoloSorbellini commented Apr 3, 2023 • edited Loading

leonselina commented Apr 4, 2023

laozhanghahaha commented Apr 4, 2023

PierpaoloSorbellini commented Apr 3, 2023 •

edited

Loading