-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Chatllama] RLHF Training - RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method #262
Comments
Hi, @shrinath-suresh this issue should be resolved in the PR #233 . The batch size can be changed in config.yaml if doesn't fit your memory requirements. |
@PierpaoloSorbellini I had the same error,hope you solve it soon. |
Hi @EthanChen1234 @shrinath-suresh |
@PierpaoloSorbellini Thanks for the fix. The issue got resolved. However, on the next steps, i observed the following behaviour. Even if the multi gpu command is issued, the RL training happens only with single gpu . I found this open issue for the same - #288 Currently getting OOM even with
Full log here - Do you think its because of the hardware restriction ? We are training with g5.12xlarge AWS instance (4 A10 GPUs with 24GB). |
Hi @shrinath-suresh, thanks for the quick response. Let me know if the problem persist since we weren't able to replicate the out of memory error for smaller models like 125m, and we are interested in looking into it. |
Sure @PierpaoloSorbellini . I have already set the batch size to 1 and enabled fp16 in ds config. config.yaml
ds_config.json
Command used to launch -
Havent changed anything in |
@PierpaoloSorbellini Thank you very much. I created a fresh setup and was able to train with the dataset given in the instructions and my dataset as well. There are few small fixes needed. Created a separate PRs for the same
Please review and let us know your comments |
While training RL model with the following command
we get the torch dataset error. To fix the error, we changed the following lines in trainer.py:427
and in trainer.py:455
Issue Reference - #229
Post this fix, we are facing the following error
Attaching the log, dataset, config for reference
Training log -
rl_train.log
Configurations -
test.zip
Dataset -
rlhf_training_data.zip
Training Environment:
Nvidia - A10 - 24 GB - g5.4xlarge - AWS instance
Packages are installed from the README instructions
The text was updated successfully, but these errors were encountered: