-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck in Ray Critic Model Initialization when second running #9
Comments
I've found out that modify parameters works well for a100, and as well set some env variables (like CUDA etc..) and then initialize ray helps before you run the training script. And now I'm successful training so I would close this issue. |
@JerryWu-code hi Jerry, I met the same problems (training script stuck), can you tell more about how you modify the parameters and variables? Many Thanks! |
@chenllliang Hey Dr. Chen, you may check my reproduced training report in this link, and you could use half the batchsize for all parameters involves export N_GPUS=2
export CUDA_VISIBLE_DEVICES=2,3
ray stop --force && ray start --head --include-dashboard=True |
@chenllliang @JerryWu-code The training script stuck problem is really strange. I also met it when I tried to run @JerryWu-code 's fork with 4 x A100. The program stucks when I change |
Thanks for your great contribution in this repo and happy new year! When I try to reproduce the project, I encounter two problems:
Problem 1: Out of memory Problem
#5 As mentioned in this issue, I also encounter the same problem when I try to run Qwen-2.5-3B-Base on two A100 GPUs, and it could train around 1 minute and then crashed due to out of memory.
Problem 2: Stuck in Ray Critic Model Initialization while Second running
After I try to set all parameters related to memory as half as the default as you set in order to train them on two A100 GPUs, and this could be the second time I run training, then I found out this time - the training script stuck in this step and cannot proceed anymore not like first time that I could successful train it.
ray stop --force && ray start --head
Before every time I run the training script and it still shows the error, would you pls give me some advices for this? Thanks~
The text was updated successfully, but these errors were encountered: