Stuck in Ray Critic Model Initialization when second running #9

JerryWu-code · 2025-01-25T11:41:35Z

Thanks for your great contribution in this repo and happy new year! When I try to reproduce the project, I encounter two problems:

Problem 1: Out of memory Problem

#5 As mentioned in this issue, I also encounter the same problem when I try to run Qwen-2.5-3B-Base on two A100 GPUs, and it could train around 1 minute and then crashed due to out of memory.

Problem 2: Stuck in Ray Critic Model Initialization while Second running

After I try to set all parameters related to memory as half as the default as you set in order to train them on two A100 GPUs, and this could be the second time I run training, then I found out this time - the training script stuck in this step and cannot proceed anymore not like first time that I could successful train it.

And after 5~10 minutes, it would show errors like:

And I've already tried

ray stop --force && ray start --head

Before every time I run the training script and it still shows the error, would you pls give me some advices for this? Thanks~

JerryWu-code · 2025-01-27T11:44:35Z

I've found out that modify parameters works well for a100, and as well set some env variables (like CUDA etc..) and then initialize ray helps before you run the training script. And now I'm successful training so I would close this issue.

chenllliang · 2025-01-30T09:17:46Z

@JerryWu-code hi Jerry, I met the same problems (training script stuck), can you tell more about how you modify the parameters and variables? Many Thanks!

JerryWu-code · 2025-01-30T09:43:50Z

@JerryWu-code hi Jerry, I met the same problems (training script stuck), can you tell more about how you modify the parameters and variables? Many Thanks!

@chenllliang Hey Dr. Chen, you may check my reproduced training report in this link, and you could use half the batchsize for all parameters involves batch in the training script and remember to export cuda devices before you ray stop --force && ray start --head, like this:

export N_GPUS=2
export CUDA_VISIBLE_DEVICES=2,3
ray stop --force && ray start --head --include-dashboard=True

HaoshengZou · 2025-02-09T09:45:52Z

@chenllliang @JerryWu-code The training script stuck problem is really strange. I also met it when I tried to run @JerryWu-code 's fork with 4 x A100.

The program stucks when I change N_GPUS=4 and ROLLOUT_TP_SIZE=4, and keep all other hparams the same as @JerryWu-code .
However, the program can run smoothly when I further change all batch hparams back to @Jiayi-Pan 's original config.

JerryWu-code closed this as completed Jan 27, 2025

HaoshengZou mentioned this issue Feb 9, 2025

Qwen 3B OOMs on 2 H100s #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck in Ray Critic Model Initialization when second running #9

Stuck in Ray Critic Model Initialization when second running #9

JerryWu-code commented Jan 25, 2025

JerryWu-code commented Jan 27, 2025

chenllliang commented Jan 30, 2025

JerryWu-code commented Jan 30, 2025

HaoshengZou commented Feb 9, 2025

Stuck in Ray Critic Model Initialization when second running #9

Stuck in Ray Critic Model Initialization when second running #9

Comments

JerryWu-code commented Jan 25, 2025

Problem 1: Out of memory Problem

Problem 2: Stuck in Ray Critic Model Initialization while Second running

JerryWu-code commented Jan 27, 2025

chenllliang commented Jan 30, 2025

JerryWu-code commented Jan 30, 2025

HaoshengZou commented Feb 9, 2025