Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chatllama] RLHF Training - RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method #262

Closed
shrinath-suresh opened this issue Mar 14, 2023 · 7 comments

Comments

@shrinath-suresh
Copy link
Contributor

shrinath-suresh commented Mar 14, 2023

While training RL model with the following command

python artifacts/main.py artifacts/config/config.yaml --type RL

we get the torch dataset error. To fix the error, we changed the following lines in trainer.py:427

        # from
        #dataloader = DataLoader(
            #ExperienceDataset(memories, device), batch_size=batch_size
        #)
       # to 
       dataset = ExperienceDataset(memories, device)

and in trainer.py:455

#from
#training_data=dataloader
#to
training_data=dataset

Issue Reference - #229

Post this fix, we are facing the following error

Using /home/ubuntu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00028014183044433594 seconds
Traceback (most recent call last):
  File "/home/ubuntu/artifacts/main.py", line 48, in <module>
    rlhf_trainer.train()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/chatllama/rlhf/trainer.py", line 788, in train
    self.learn(memories)
  File "<@beartype(chatllama.rlhf.trainer.RLTrainer.learn) at 0x7f279c996820>", line 33, in learn
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/chatllama/rlhf/trainer.py", line 492, in learn
    for i, (
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/deepspeed/runtime/dataloader.py", line 125, in __next__
    return next(self.data)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/deepspeed/runtime/dataloader.py", line 158, in <genexpr>
    self.data = (x for x in self.dataloader)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
    return self._process_data(data)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
    data.reraise()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/chatllama/rlhf/trainer.py", line 207, in __getitem__
    self.data[idx].states.to(self.device),
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 217, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Attaching the log, dataset, config for reference

Training log -
rl_train.log
Configurations -
test.zip
Dataset -
rlhf_training_data.zip

Training Environment:
Nvidia - A10 - 24 GB - g5.4xlarge - AWS instance
Packages are installed from the README instructions

@PierpaoloSorbellini
Copy link
Collaborator

Hi, @shrinath-suresh this issue should be resolved in the PR #233 . The batch size can be changed in config.yaml if doesn't fit your memory requirements.
Let me know if everything is working.
Thanks for your feedback!

@PierpaoloSorbellini PierpaoloSorbellini changed the title RLHF Training - RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method [Chatllama] RLHF Training - RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method Mar 17, 2023
@EthanChen1234
Copy link

@PierpaoloSorbellini I had the same error,hope you solve it soon.
"RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

@PierpaoloSorbellini
Copy link
Collaborator

Hi @EthanChen1234 @shrinath-suresh
The problem should have been resolved in #306 let me know if the issue persist!

@shrinath-suresh
Copy link
Contributor Author

@PierpaoloSorbellini Thanks for the fix. The issue got resolved. However, on the next steps, i observed the following behaviour.

Even if the multi gpu command is issued, the RL training happens only with single gpu . I found this open issue for the same - #288

Currently getting OOM even with opt-125m.pt model. Attaching log for reference

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 22.04 GiB total capacity; 2.39 GiB already allocated; 3.12 MiB free; 2.50 GiB reserved in total by PyTorch) If reserved memory is 
>> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Full log here -
chatllama-multi-gpu.log

Do you think its because of the hardware restriction ? We are training with g5.12xlarge AWS instance (4 A10 GPUs with 24GB).

@PierpaoloSorbellini
Copy link
Collaborator

PierpaoloSorbellini commented Apr 3, 2023

Hi @shrinath-suresh, thanks for the quick response.
Did you enable DeepSpeed or accelerate in the config.yaml and use the proper command to launch the training (i.e. deepspeed or accelerate launch instead of python)?
More info should be already available at the bottom of the readme.md in the PR.
We actually observed that all the GPUs were used during testing.
For the out of memory error try to play with the batch size after observing if the GPUs are actually used.
We are introducing some optimization techniques for reducing memory requirements, like LoRA using the peft library. We still have to fine tune the hyperparameters of these techinques. You can enable peft by the same config.yaml file and in the same folder you can find the related configuration options, but you still have to play a bit with them once enabled.
More info are going to be added in further development of the documentation.

Let me know if the problem persist since we weren't able to replicate the out of memory error for smaller models like 125m, and we are interested in looking into it.

@shrinath-suresh
Copy link
Contributor Author

shrinath-suresh commented Apr 4, 2023

Sure @PierpaoloSorbellini . I have already set the batch size to 1 and enabled fp16 in ds config.

config.yaml

---
trainer_config:
  # learning rates
  actor_lr: 0.000005
  critic_lr: 0.000009
  # PPO Hyperparameters
  actor_eps_clip: 0.2
  critic_eps_clip: 0.2
  beta_s: 0.02
  # coefficient for the discounted rewards
  gamma_discounted: 1 
  # path to examples to be sampled (training dataset) see rlhf_dataset.json
  examples_path: "./rlhf_training_data.json"
  # number of episodes and generation performed for each episode
  # in the train() method
  num_episodes: 1
  max_timesteps: 4
  # number of timesteps after which the learn() method is called 
  # (to update the weights)
  update_timesteps: 4
  # number of example sampled at each timestep
  num_examples: 1
  # batch and epochs for the training
  batch_size: 1
  epochs: 1
  # number of episodes after which update the checkpoints in RL training
  checkpoint_steps: 1000
  # here specify the name of the actor_rl checkpoint from which resume 
  # during actor RL training. If null load the last one.
  checkpoint_name: null

actor_config:
  model: "facebook/opt-125m" 
  model_folder: "./models"
  tokenizer_path: "path-to-tokenizer"
  train_dataset_path: "./actor_training_data.json"
  validation_dataset_path: null
  # froze model embedding during training
  froze_embeddings: True
  # use fairscale layers to build the model instead of vanilla pytorch
  # only for llama
  use_fairscale: False
  # max sequence length for the actor (i.e. prompt + completion) it depends on
  # the model used.
  max_sequence_length: 2048
  # max tokens generated by the actor (completion only)
  max_tokens: 2048
  # minimum number of tokens generated by the actor
  min_tokens: 100
  # additional prompt tokens to be used for template or as safety
  additonal_prompt_tokens: 20
  # temperature for the actor
  temperature: 0.1
  batch_size: 1
  # number iteration after print
  iteration_per_print: 1000
  lr: 0.000009
  epochs: 1
  # number of backpropagation after saving the checkpoints
  checkpoint_steps: 1000
  # number of checkpoints to keep while removing the older 
  # (keep memory consumption of checkpoints reasonable)
  n_checkpoints_to_keep: 1
  # here specify the name of the actor checkpoint from which resume 
  # during actor training. If null load the last one.
  checkpoint_name: null
  # deepspeed settings
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config.json"
  # accelerate settings
  accelerate_enable: False
  # use_peft - the parameters of PEFT can be modified in the peft_config.yaml
  peft_enable: True
  peft_config_path: "./artifacts/config/peft_config.yaml"

reward_config:
  # model to be chosen are gp2-large, bart-base, longformer-base-4096
  # more can be simply added in the reward.py __init__()
  model: "facebook/opt-125m"
  model_folder: "./models"
  # hidden size of the additional ffw head to produce the scores
  model_head_hidden_size: 2048
  max_sequence_length: 2048
  train_dataset_path: "./reward_training_data.json"
  validation_dataset_path: null
  batch_size: 1
  epochs: 1
  iteration_per_print: 1000
  # steps after which the checkpoint are saved
  checkpoint_steps: 10000
  # here specify the name of the reward checkpoint from which resume 
  # during reward training. If null load the last one.
  checkpoint_name: null
  lr: 0.000009
  # deepspeed settings
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config.json"
  # accelerate settings
  accelerate_enable: False

critic_config:
  # model to be chosen are gp2-large, bart-base, longformer-base-4096
  # more can be simply added in the reward.py __init__()
  model: "facebook/opt-125m"
  # hidden size of the additional ffw head to produce the scores
  model_head_hidden_size: 2048
  max_sequence_length: 2048
  model_folder: "./models"
  # here specify the name of the critic checkpoint from which resume 
  # during critic training. If null load the last one.
  checkpoint_name: null

ds_config.json

{
    "train_batch_size": 1,
    "gradient_accumulation_steps": 1,
    "optimizer": {
      "type": "Adam",
      "params": {
        "lr": 0.00015
      }
    },
    "fp16": {
      "enabled": true,
      "auto_cast": true,
      "loss_scale": 0,
      "initial_scale_power": 16,
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients" : true,
    "offload_param": {
      "device": "cpu",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 5,
      "buffer_size": 1e8,
      "max_in_cpu": 1e9
    },
    "offload_optimizer": {
      "device": "cpu",
      "nvme_path": "/local_nvme",
      "pin_memory": true,
      "buffer_count": 4,
      "fast_init": false
    },
    "stage3_max_live_parameters" : 1e9,
    "stage3_max_reuse_distance" : 1e9,
    "stage3_prefetch_bucket_size" : 5e8,
    "stage3_param_persistence_threshold" : 1e6,
    "sub_group_size" : 1e12,
    "elastic_checkpoint" : true,
    "stage3_gather_16bit_weights_on_model_save": true,
    "ignore_unused_parameters": true,
    "round_robin_gradients": true
    }
  }

Command used to launch -

accelerate launch --multi_gpu artifacts/main.py artifacts/config/config.yaml --type RL

Havent changed anything in peft.yaml. will try it out.

@shrinath-suresh
Copy link
Contributor Author

@PierpaoloSorbellini Thank you very much. I created a fresh setup and was able to train with the dataset given in the instructions and my dataset as well.

There are few small fixes needed. Created a separate PRs for the same

  1. Readme fix - [ChatLLaMA] DocFix - RL training command in README.md missing filename #327
  2. RL training - [ChatLLaMA] RL Trainer - is_deepspeed_init variable initialization #329
  3. Reward training - [ChatLLaMA] Reward Dataset Score Calculation #330

Please review and let us know your comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants