Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in Ray Critic Model Initialization when second running #9

Closed
JerryWu-code opened this issue Jan 25, 2025 · 4 comments
Closed

Comments

@JerryWu-code
Copy link

Thanks for your great contribution in this repo and happy new year! When I try to reproduce the project, I encounter two problems:

Problem 1: Out of memory Problem

#5 As mentioned in this issue, I also encounter the same problem when I try to run Qwen-2.5-3B-Base on two A100 GPUs, and it could train around 1 minute and then crashed due to out of memory.

Problem 2: Stuck in Ray Critic Model Initialization while Second running

After I try to set all parameters related to memory as half as the default as you set in order to train them on two A100 GPUs, and this could be the second time I run training, then I found out this time - the training script stuck in this step and cannot proceed anymore not like first time that I could successful train it.

And after 5~10 minutes, it would show errors like:

And I've already tried
ray stop --force && ray start --head

Before every time I run the training script and it still shows the error, would you pls give me some advices for this? Thanks~

@JerryWu-code
Copy link
Author

I've found out that modify parameters works well for a100, and as well set some env variables (like CUDA etc..) and then initialize ray helps before you run the training script. And now I'm successful training so I would close this issue.

@chenllliang
Copy link

@JerryWu-code hi Jerry, I met the same problems (training script stuck), can you tell more about how you modify the parameters and variables? Many Thanks!

@JerryWu-code
Copy link
Author

@JerryWu-code hi Jerry, I met the same problems (training script stuck), can you tell more about how you modify the parameters and variables? Many Thanks!

@chenllliang Hey Dr. Chen, you may check my reproduced training report in this link, and you could use half the batchsize for all parameters involves batch in the training script and remember to export cuda devices before you ray stop --force && ray start --head, like this:

export N_GPUS=2
export CUDA_VISIBLE_DEVICES=2,3
ray stop --force && ray start --head --include-dashboard=True

@HaoshengZou
Copy link

@chenllliang @JerryWu-code The training script stuck problem is really strange. I also met it when I tried to run @JerryWu-code 's fork with 4 x A100.

The program stucks when I change N_GPUS=4 and ROLLOUT_TP_SIZE=4, and keep all other hparams the same as @JerryWu-code .
However, the program can run smoothly when I further change all batch hparams back to @Jiayi-Pan 's original config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants