Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. #58

Closed
VincentXWD opened this issue Feb 10, 2025 · 4 comments

Comments

@VincentXWD
Copy link

Hello developers,
Thanks for your great and impressive work and Good job! I'm trying to reproduce TinyZero, tuning qwen2.5-3B on 4 NVIDIA A100 GPUs (80GB Gmem for each) via slurm. Here's my running scripts:

#!/bin/bash
# alias python='/home/weiji/anaconda3/envs/zero/bin/python'
# alias python3='/home/weiji/anaconda3/envs/zero/bin/python3'
# alias pip='/home/weiji/anaconda3/envs/zero/bin/pip'

export N_GPUS=4
export CUDA_VISIBLE_DEVICES=0,1,2,3
ray stop --force && ray start --head --include-dashboard=True  --object-store-memory=53687091200
export BASE_MODEL="model/Qwen2.5-3B"
export DATA_DIR="/mnt/Data/wdxu/github/TinyZero_A100/dataset"
export ROLLOUT_TP_SIZE=4
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero_a100_ppo.sh
HYDRA_FULL_ERROR=1 python -m verl.trainer.main_ppo \
    data.train_files=$DATA_DIR/train.parquet \
    data.val_files=$DATA_DIR/test.parquet \
    data.train_batch_size=128 \
    data.val_batch_size=640 \
    data.max_prompt_length=256 \
    data.max_response_length=1024 \
    actor_rollout_ref.model.path=$BASE_MODEL \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=2 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    critic.model.enable_gradient_checkpointing=True \
    critic.optim.lr=1e-5 \
    critic.model.path=$BASE_MODEL \
    critic.ppo_micro_batch_size=4 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.logger=['wandb'] \
    +trainer.val_before_train=False \
    trainer.default_hdfs_dir=null \
    trainer.n_gpus_per_node=$N_GPUS \
    trainer.nnodes=1 \
    trainer.save_freq=10 \
    trainer.test_freq=10 \
    trainer.project_name=TinyZero \
    trainer.experiment_name=$EXPERIMENT_NAME \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

And got errors below:

2025-02-10 21:29:39,179 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 147.8.181.248:6379...
2025-02-10 21:29:39,188 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
(main_task pid=1042684) {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
(main_task pid=1042684)                                  'entropy_coeff': 0.001,
(main_task pid=1042684)                                  'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                                  'grad_offload': False,
(main_task pid=1042684)                                                  'optimizer_offload': False,
(main_task pid=1042684)                                                  'param_offload': False,
(main_task pid=1042684)                                                  'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                                  'grad_clip': 1.0,
(main_task pid=1042684)                                  'kl_loss_coef': 0.001,
(main_task pid=1042684)                                  'kl_loss_type': 'low_var_kl',
(main_task pid=1042684)                                  'optim': {'lr': 1e-06,
(main_task pid=1042684)                                            'lr_warmup_steps_ratio': 0.0,
(main_task pid=1042684)                                            'min_lr_ratio': None,
(main_task pid=1042684)                                            'total_training_steps': -1,
(main_task pid=1042684)                                            'warmup_style': 'constant'},
(main_task pid=1042684)                                  'ppo_epochs': 1,
(main_task pid=1042684)                                  'ppo_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                  'ppo_micro_batch_size': 4,
(main_task pid=1042684)                                  'ppo_mini_batch_size': 64,
(main_task pid=1042684)                                  'shuffle': False,
(main_task pid=1042684)                                  'strategy': 'fsdp',
(main_task pid=1042684)                                  'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)                                  'use_dynamic_bsz': False,
(main_task pid=1042684)                                  'use_kl_loss': False},
(main_task pid=1042684)                        'hybrid_engine': True,
(main_task pid=1042684)                        'model': {'enable_gradient_checkpointing': True,
(main_task pid=1042684)                                  'external_lib': None,
(main_task pid=1042684)                                  'override_config': {},
(main_task pid=1042684)                                  'path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                                  'use_remove_padding': False},
(main_task pid=1042684)                        'ref': {'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                                'param_offload': False,
(main_task pid=1042684)                                                'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                                'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                'log_prob_micro_batch_size': 2,
(main_task pid=1042684)                                'log_prob_use_dynamic_bsz': False,
(main_task pid=1042684)                                'ulysses_sequence_parallel_size': 1},
(main_task pid=1042684)                        'rollout': {'do_sample': True,
(main_task pid=1042684)                                    'dtype': 'bfloat16',
(main_task pid=1042684)                                    'enforce_eager': True,
(main_task pid=1042684)                                    'free_cache_engine': True,
(main_task pid=1042684)                                    'gpu_memory_utilization': 0.7,
(main_task pid=1042684)                                    'ignore_eos': False,
(main_task pid=1042684)                                    'load_format': 'dummy_dtensor',
(main_task pid=1042684)                                    'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                    'log_prob_micro_batch_size': 4,
(main_task pid=1042684)                                    'log_prob_use_dynamic_bsz': False,
(main_task pid=1042684)                                    'max_num_batched_tokens': 8192,
(main_task pid=1042684)                                    'max_num_seqs': 1024,
(main_task pid=1042684)                                    'n': 1,
(main_task pid=1042684)                                    'name': 'vllm',
(main_task pid=1042684)                                    'prompt_length': 256,
(main_task pid=1042684)                                    'response_length': 1024,
(main_task pid=1042684)                                    'temperature': 1.0,
(main_task pid=1042684)                                    'tensor_model_parallel_size': 4,
(main_task pid=1042684)                                    'top_k': -1,
(main_task pid=1042684)                                    'top_p': 1}},
(main_task pid=1042684)  'algorithm': {'adv_estimator': 'gae',
(main_task pid=1042684)                'gamma': 1.0,
(main_task pid=1042684)                'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
(main_task pid=1042684)                'kl_penalty': 'kl',
(main_task pid=1042684)                'lam': 1.0},
(main_task pid=1042684)  'critic': {'cliprange_value': 0.5,
(main_task pid=1042684)             'forward_max_token_len_per_gpu': 32768,
(main_task pid=1042684)             'forward_micro_batch_size': 4,
(main_task pid=1042684)             'grad_clip': 1.0,
(main_task pid=1042684)             'model': {'enable_gradient_checkpointing': True,
(main_task pid=1042684)                       'external_lib': None,
(main_task pid=1042684)                       'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                       'grad_offload': False,
(main_task pid=1042684)                                       'optimizer_offload': False,
(main_task pid=1042684)                                       'param_offload': False,
(main_task pid=1042684)                                       'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                       'override_config': {},
(main_task pid=1042684)                       'path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                       'tokenizer_path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                       'use_remove_padding': False},
(main_task pid=1042684)             'optim': {'lr': 1e-05,
(main_task pid=1042684)                       'lr_warmup_steps_ratio': 0.0,
(main_task pid=1042684)                       'min_lr_ratio': None,
(main_task pid=1042684)                       'total_training_steps': -1,
(main_task pid=1042684)                       'warmup_style': 'constant'},
(main_task pid=1042684)             'ppo_epochs': 1,
(main_task pid=1042684)             'ppo_max_token_len_per_gpu': 32768,
(main_task pid=1042684)             'ppo_micro_batch_size': 4,
(main_task pid=1042684)             'ppo_mini_batch_size': 64,
(main_task pid=1042684)             'shuffle': False,
(main_task pid=1042684)             'strategy': 'fsdp',
(main_task pid=1042684)             'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)             'use_dynamic_bsz': False},
(main_task pid=1042684)  'data': {'max_prompt_length': 256,
(main_task pid=1042684)           'max_response_length': 1024,
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]Dict pid=1044330)
Loading checkpoint shards:  50%|█████     | 1/2 [00:29<00:29, 29.36s/it]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s] [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Loading checkpoint shards: 100%|██████████| 2/2 [00:46<00:00, 23.17s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:29<00:29, 29.76s/it] [repeated 3x across cluster]

(main_task pid=1042684)           'return_raw_chat': False,
(main_task pid=1042684)           'return_raw_input_ids': False,
(main_task pid=1042684)           'tokenizer': None,
(main_task pid=1042684)           'train_batch_size': 128,
(main_task pid=1042684)           'train_files': '/mnt/Data/wdxu/github/TinyZero_A100/dataset/train.parquet',
(main_task pid=1042684)           'val_batch_size': 640,
(main_task pid=1042684)           'val_files': '/mnt/Data/wdxu/github/TinyZero_A100/dataset/test.parquet'},
(main_task pid=1042684)  'reward_model': {'enable': False,
(main_task pid=1042684)                   'forward_max_token_len_per_gpu': 32768,
(main_task pid=1042684)                   'max_length': None,
(main_task pid=1042684)                   'micro_batch_size': 64,
(main_task pid=1042684)                   'model': {'external_lib': None,
(main_task pid=1042684)                             'fsdp_config': {'min_num_params': 0,
(main_task pid=1042684)                                             'param_offload': False},
(main_task pid=1042684)                             'input_tokenizer': 'model/Qwen2.5-3B',
(main_task pid=1042684)                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
(main_task pid=1042684)                             'use_remove_padding': False},
(main_task pid=1042684)                   'strategy': 'fsdp',
(main_task pid=1042684)                   'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)                   'use_dynamic_bsz': False},
(main_task pid=1042684)  'trainer': {'critic_warmup': 0,
(main_task pid=1042684)              'default_hdfs_dir': None,
(main_task pid=1042684)              'default_local_dir': 'checkpoints/TinyZero/countdown-qwen2.5-3b',
(main_task pid=1042684)              'experiment_name': 'countdown-qwen2.5-3b',
(main_task pid=1042684)              'logger': ['wandb'],
(main_task pid=1042684)              'n_gpus_per_node': 4,
(main_task pid=1042684)              'nnodes': 1,
(main_task pid=1042684)              'project_name': 'TinyZero',
(main_task pid=1042684)              'save_freq': 10,
(main_task pid=1042684)              'test_freq': 10,
(main_task pid=1042684)              'total_epochs': 15,
(main_task pid=1042684)              'total_training_steps': None,
(main_task pid=1042684)              'val_before_train': False}}
(main_task pid=1042684) original dataset len: 327680
(main_task pid=1042684) filter dataset len: 327680
(main_task pid=1042684) original dataset len: 1024
(main_task pid=1042684) filter dataset len: 1024
(main_task pid=1042684) Size of train dataloader: 2560
(main_task pid=1042684) Size of val dataloader: 1
(main_task pid=1042684) Total training steps: 38400
(WorkerDict pid=1043246) Critic overriding config {'bos_token_id': None, 'eos_token_id': 151643, 'pad_token_id': 151643}
(WorkerDict pid=1043246) Qwen2ForTokenClassification contains 3.09B parameters
(WorkerDict pid=1043246) Before critic FSDP, memory allocated (GB): 11.497002601623535, memory reserved (GB): 11.498046875
(WorkerDict pid=1043246) NCCL version 2.20.5+cuda12.4
(WorkerDict pid=1044330) Total steps: 38400, num_warmup_steps: 0
(WorkerDict pid=1044330) Critic use_remove_padding=False
(WorkerDict pid=1043246) After critic FSDP, memory allocated (GB): 2.8744893074035645, memory reserved (GB): 24.876953125
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa3b07b0532f2ee4758a6fb3901000000 Worker ID: d72700074b743ca3ea74c39b4ca1dafbeaa822671d1793e3aeb357e0 Node ID: 8b08680f6f77a892ba2ba3ef7e1f3a6f3ef0647ee0897833731d10a5 Worker IP address: 147.8.181.248 Worker port: 10137 Worker PID: 1044332 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(WorkerDict pid=1043246) Total steps: 38400, num_warmup_steps: 0 [repeated 2x across cluster]
(WorkerDict pid=1043246) Critic use_remove_padding=False [repeated 2x across cluster]
Error executing job with overrides: ['data.train_files=/mnt/Data/wdxu/github/TinyZero_A100/dataset/train.parquet', 'data.val_files=/mnt/Data/wdxu/github/TinyZero_A100/dataset/test.parquet', 'data.train_batch_size=128', 'data.val_batch_size=640', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=model/Qwen2.5-3B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=4', 'actor_rollout_ref.rollout.tensor_model_parallel_size=4', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.7', 'actor_rollout_ref.ref.log_prob_micro_batch_size=2', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'critic.model.enable_gradient_checkpointing=True', 'critic.optim.lr=1e-5', 'critic.model.path=model/Qwen2.5-3B', 'critic.ppo_micro_batch_size=4', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=10', 'trainer.test_freq=10', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-3b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 193, in <module>
    main()
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=1042684, ip=147.8.181.248)
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/ppo/ray_trainer.py", line 502, in init_workers
    self.critic_wg.init_model()
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: create_colocated_worker_cls.<locals>.WorkerDict
        actor_id: a3b07b0532f2ee4758a6fb3901000000
        pid: 1044332
        name: 4KTOdzWorkerDict_0:3
        namespace: 6ddbb09f-fc97-427e-a920-9ab35440c00e
        ip: 147.8.181.248
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Loading checkpoint shards: 100%|██████████| 2/2 [00:46<00:00, 23.32s/it] [repeated 3x across cluster]
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff7743fbca0ec8f4671e329e9301000000 Worker ID: 634dbf1906ac4242529c59e188f817ad98ec3ab7b4bdf841ad794e07 Node ID: 8b08680f6f77a892ba2ba3ef7e1f3a6f3ef0647ee0897833731d10a5 Worker IP address: 147.8.181.248 Worker port: 10135 Worker PID: 1044330 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. [repeated 3x across cluster]

Does anyone know how to solve such a problem? Thanks!

@SefaZeng
Copy link

Meet the same issue, have you fix it?

@yuleiqin
Copy link

Meet the same issue, have you fix it?

even without the slurm I still meet the same problem with a 8xH20 machine on the 3B model

hon.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 241)
return func()
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/utils.py", line 458, in
lambda: hydra.run(
File "/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/usr/local/lib/python3.10/dist-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/cfs/yuleiqin/code/TinyZero/verl/trainer/main_ppo.py", line 103, in main
ray.get(main_task.remote(config))
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2772, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=1316729, ip=10.0.20.108)
File "/cfs/yuleiqin/code/TinyZero/verl/trainer/main_ppo.py", line 189, in main_task
trainer.fit()
File "/cfs/yuleiqin/code/TinyZero/verl/trainer/ppo/ray_trainer.py", line 589, in fit
gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
File "/cfs/yuleiqin/code/TinyZero/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: create_colocated_worker_cls..WorkerDict
actor_id: 0ee402b7d125ceb05ea9b3d601000000
pid: 1354236
name: RG6dmfWorkerDict_0:1
namespace: 1c4904ef-60a8-4f46-ac6f-74cec46ad571
ip: 10.0.20.108
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb6e0e4b29b4323cc425df51601000000 Worker ID: a62ab82fa09c4f3a7e740059b81f3d960b16b971ec00e64f2513294a Node ID: 532a39afafa9ef57d44f8df7522dd87729fde020af5a83fdd8ef2264 Worker IP address: 10.0.20.108 Worker port: 34303 Worker PID: 1353968 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

@yuleiqin
Copy link

Hello developers, Thanks for your great and impressive work and Good job! I'm trying to reproduce TinyZero, tuning qwen2.5-3B on 4 NVIDIA A100 GPUs (80GB Gmem for each) via slurm. Here's my running scripts:

#!/bin/bash
# alias python='/home/weiji/anaconda3/envs/zero/bin/python'
# alias python3='/home/weiji/anaconda3/envs/zero/bin/python3'
# alias pip='/home/weiji/anaconda3/envs/zero/bin/pip'

export N_GPUS=4
export CUDA_VISIBLE_DEVICES=0,1,2,3
ray stop --force && ray start --head --include-dashboard=True  --object-store-memory=53687091200
export BASE_MODEL="model/Qwen2.5-3B"
export DATA_DIR="/mnt/Data/wdxu/github/TinyZero_A100/dataset"
export ROLLOUT_TP_SIZE=4
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero_a100_ppo.sh
HYDRA_FULL_ERROR=1 python -m verl.trainer.main_ppo \
    data.train_files=$DATA_DIR/train.parquet \
    data.val_files=$DATA_DIR/test.parquet \
    data.train_batch_size=128 \
    data.val_batch_size=640 \
    data.max_prompt_length=256 \
    data.max_response_length=1024 \
    actor_rollout_ref.model.path=$BASE_MODEL \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=2 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    critic.model.enable_gradient_checkpointing=True \
    critic.optim.lr=1e-5 \
    critic.model.path=$BASE_MODEL \
    critic.ppo_micro_batch_size=4 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.logger=['wandb'] \
    +trainer.val_before_train=False \
    trainer.default_hdfs_dir=null \
    trainer.n_gpus_per_node=$N_GPUS \
    trainer.nnodes=1 \
    trainer.save_freq=10 \
    trainer.test_freq=10 \
    trainer.project_name=TinyZero \
    trainer.experiment_name=$EXPERIMENT_NAME \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

And got errors below:

2025-02-10 21:29:39,179 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 147.8.181.248:6379...
2025-02-10 21:29:39,188 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
(main_task pid=1042684) {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
(main_task pid=1042684)                                  'entropy_coeff': 0.001,
(main_task pid=1042684)                                  'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                                  'grad_offload': False,
(main_task pid=1042684)                                                  'optimizer_offload': False,
(main_task pid=1042684)                                                  'param_offload': False,
(main_task pid=1042684)                                                  'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                                  'grad_clip': 1.0,
(main_task pid=1042684)                                  'kl_loss_coef': 0.001,
(main_task pid=1042684)                                  'kl_loss_type': 'low_var_kl',
(main_task pid=1042684)                                  'optim': {'lr': 1e-06,
(main_task pid=1042684)                                            'lr_warmup_steps_ratio': 0.0,
(main_task pid=1042684)                                            'min_lr_ratio': None,
(main_task pid=1042684)                                            'total_training_steps': -1,
(main_task pid=1042684)                                            'warmup_style': 'constant'},
(main_task pid=1042684)                                  'ppo_epochs': 1,
(main_task pid=1042684)                                  'ppo_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                  'ppo_micro_batch_size': 4,
(main_task pid=1042684)                                  'ppo_mini_batch_size': 64,
(main_task pid=1042684)                                  'shuffle': False,
(main_task pid=1042684)                                  'strategy': 'fsdp',
(main_task pid=1042684)                                  'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)                                  'use_dynamic_bsz': False,
(main_task pid=1042684)                                  'use_kl_loss': False},
(main_task pid=1042684)                        'hybrid_engine': True,
(main_task pid=1042684)                        'model': {'enable_gradient_checkpointing': True,
(main_task pid=1042684)                                  'external_lib': None,
(main_task pid=1042684)                                  'override_config': {},
(main_task pid=1042684)                                  'path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                                  'use_remove_padding': False},
(main_task pid=1042684)                        'ref': {'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                                'param_offload': False,
(main_task pid=1042684)                                                'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                                'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                'log_prob_micro_batch_size': 2,
(main_task pid=1042684)                                'log_prob_use_dynamic_bsz': False,
(main_task pid=1042684)                                'ulysses_sequence_parallel_size': 1},
(main_task pid=1042684)                        'rollout': {'do_sample': True,
(main_task pid=1042684)                                    'dtype': 'bfloat16',
(main_task pid=1042684)                                    'enforce_eager': True,
(main_task pid=1042684)                                    'free_cache_engine': True,
(main_task pid=1042684)                                    'gpu_memory_utilization': 0.7,
(main_task pid=1042684)                                    'ignore_eos': False,
(main_task pid=1042684)                                    'load_format': 'dummy_dtensor',
(main_task pid=1042684)                                    'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                    'log_prob_micro_batch_size': 4,
(main_task pid=1042684)                                    'log_prob_use_dynamic_bsz': False,
(main_task pid=1042684)                                    'max_num_batched_tokens': 8192,
(main_task pid=1042684)                                    'max_num_seqs': 1024,
(main_task pid=1042684)                                    'n': 1,
(main_task pid=1042684)                                    'name': 'vllm',
(main_task pid=1042684)                                    'prompt_length': 256,
(main_task pid=1042684)                                    'response_length': 1024,
(main_task pid=1042684)                                    'temperature': 1.0,
(main_task pid=1042684)                                    'tensor_model_parallel_size': 4,
(main_task pid=1042684)                                    'top_k': -1,
(main_task pid=1042684)                                    'top_p': 1}},
(main_task pid=1042684)  'algorithm': {'adv_estimator': 'gae',
(main_task pid=1042684)                'gamma': 1.0,
(main_task pid=1042684)                'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
(main_task pid=1042684)                'kl_penalty': 'kl',
(main_task pid=1042684)                'lam': 1.0},
(main_task pid=1042684)  'critic': {'cliprange_value': 0.5,
(main_task pid=1042684)             'forward_max_token_len_per_gpu': 32768,
(main_task pid=1042684)             'forward_micro_batch_size': 4,
(main_task pid=1042684)             'grad_clip': 1.0,
(main_task pid=1042684)             'model': {'enable_gradient_checkpointing': True,
(main_task pid=1042684)                       'external_lib': None,
(main_task pid=1042684)                       'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                       'grad_offload': False,
(main_task pid=1042684)                                       'optimizer_offload': False,
(main_task pid=1042684)                                       'param_offload': False,
(main_task pid=1042684)                                       'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                       'override_config': {},
(main_task pid=1042684)                       'path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                       'tokenizer_path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                       'use_remove_padding': False},
(main_task pid=1042684)             'optim': {'lr': 1e-05,
(main_task pid=1042684)                       'lr_warmup_steps_ratio': 0.0,
(main_task pid=1042684)                       'min_lr_ratio': None,
(main_task pid=1042684)                       'total_training_steps': -1,
(main_task pid=1042684)                       'warmup_style': 'constant'},
(main_task pid=1042684)             'ppo_epochs': 1,
(main_task pid=1042684)             'ppo_max_token_len_per_gpu': 32768,
(main_task pid=1042684)             'ppo_micro_batch_size': 4,
(main_task pid=1042684)             'ppo_mini_batch_size': 64,
(main_task pid=1042684)             'shuffle': False,
(main_task pid=1042684)             'strategy': 'fsdp',
(main_task pid=1042684)             'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)             'use_dynamic_bsz': False},
(main_task pid=1042684)  'data': {'max_prompt_length': 256,
(main_task pid=1042684)           'max_response_length': 1024,
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]Dict pid=1044330)
Loading checkpoint shards:  50%|█████     | 1/2 [00:29<00:29, 29.36s/it]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s] [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Loading checkpoint shards: 100%|██████████| 2/2 [00:46<00:00, 23.17s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:29<00:29, 29.76s/it] [repeated 3x across cluster]

(main_task pid=1042684)           'return_raw_chat': False,
(main_task pid=1042684)           'return_raw_input_ids': False,
(main_task pid=1042684)           'tokenizer': None,
(main_task pid=1042684)           'train_batch_size': 128,
(main_task pid=1042684)           'train_files': '/mnt/Data/wdxu/github/TinyZero_A100/dataset/train.parquet',
(main_task pid=1042684)           'val_batch_size': 640,
(main_task pid=1042684)           'val_files': '/mnt/Data/wdxu/github/TinyZero_A100/dataset/test.parquet'},
(main_task pid=1042684)  'reward_model': {'enable': False,
(main_task pid=1042684)                   'forward_max_token_len_per_gpu': 32768,
(main_task pid=1042684)                   'max_length': None,
(main_task pid=1042684)                   'micro_batch_size': 64,
(main_task pid=1042684)                   'model': {'external_lib': None,
(main_task pid=1042684)                             'fsdp_config': {'min_num_params': 0,
(main_task pid=1042684)                                             'param_offload': False},
(main_task pid=1042684)                             'input_tokenizer': 'model/Qwen2.5-3B',
(main_task pid=1042684)                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
(main_task pid=1042684)                             'use_remove_padding': False},
(main_task pid=1042684)                   'strategy': 'fsdp',
(main_task pid=1042684)                   'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)                   'use_dynamic_bsz': False},
(main_task pid=1042684)  'trainer': {'critic_warmup': 0,
(main_task pid=1042684)              'default_hdfs_dir': None,
(main_task pid=1042684)              'default_local_dir': 'checkpoints/TinyZero/countdown-qwen2.5-3b',
(main_task pid=1042684)              'experiment_name': 'countdown-qwen2.5-3b',
(main_task pid=1042684)              'logger': ['wandb'],
(main_task pid=1042684)              'n_gpus_per_node': 4,
(main_task pid=1042684)              'nnodes': 1,
(main_task pid=1042684)              'project_name': 'TinyZero',
(main_task pid=1042684)              'save_freq': 10,
(main_task pid=1042684)              'test_freq': 10,
(main_task pid=1042684)              'total_epochs': 15,
(main_task pid=1042684)              'total_training_steps': None,
(main_task pid=1042684)              'val_before_train': False}}
(main_task pid=1042684) original dataset len: 327680
(main_task pid=1042684) filter dataset len: 327680
(main_task pid=1042684) original dataset len: 1024
(main_task pid=1042684) filter dataset len: 1024
(main_task pid=1042684) Size of train dataloader: 2560
(main_task pid=1042684) Size of val dataloader: 1
(main_task pid=1042684) Total training steps: 38400
(WorkerDict pid=1043246) Critic overriding config {'bos_token_id': None, 'eos_token_id': 151643, 'pad_token_id': 151643}
(WorkerDict pid=1043246) Qwen2ForTokenClassification contains 3.09B parameters
(WorkerDict pid=1043246) Before critic FSDP, memory allocated (GB): 11.497002601623535, memory reserved (GB): 11.498046875
(WorkerDict pid=1043246) NCCL version 2.20.5+cuda12.4
(WorkerDict pid=1044330) Total steps: 38400, num_warmup_steps: 0
(WorkerDict pid=1044330) Critic use_remove_padding=False
(WorkerDict pid=1043246) After critic FSDP, memory allocated (GB): 2.8744893074035645, memory reserved (GB): 24.876953125
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa3b07b0532f2ee4758a6fb3901000000 Worker ID: d72700074b743ca3ea74c39b4ca1dafbeaa822671d1793e3aeb357e0 Node ID: 8b08680f6f77a892ba2ba3ef7e1f3a6f3ef0647ee0897833731d10a5 Worker IP address: 147.8.181.248 Worker port: 10137 Worker PID: 1044332 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(WorkerDict pid=1043246) Total steps: 38400, num_warmup_steps: 0 [repeated 2x across cluster]
(WorkerDict pid=1043246) Critic use_remove_padding=False [repeated 2x across cluster]
Error executing job with overrides: ['data.train_files=/mnt/Data/wdxu/github/TinyZero_A100/dataset/train.parquet', 'data.val_files=/mnt/Data/wdxu/github/TinyZero_A100/dataset/test.parquet', 'data.train_batch_size=128', 'data.val_batch_size=640', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=model/Qwen2.5-3B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=4', 'actor_rollout_ref.rollout.tensor_model_parallel_size=4', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.7', 'actor_rollout_ref.ref.log_prob_micro_batch_size=2', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'critic.model.enable_gradient_checkpointing=True', 'critic.optim.lr=1e-5', 'critic.model.path=model/Qwen2.5-3B', 'critic.ppo_micro_batch_size=4', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=10', 'trainer.test_freq=10', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-3b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 193, in <module>
    main()
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=1042684, ip=147.8.181.248)
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/ppo/ray_trainer.py", line 502, in init_workers
    self.critic_wg.init_model()
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: create_colocated_worker_cls.<locals>.WorkerDict
        actor_id: a3b07b0532f2ee4758a6fb3901000000
        pid: 1044332
        name: 4KTOdzWorkerDict_0:3
        namespace: 6ddbb09f-fc97-427e-a920-9ab35440c00e
        ip: 147.8.181.248
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Loading checkpoint shards: 100%|██████████| 2/2 [00:46<00:00, 23.32s/it] [repeated 3x across cluster]
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff7743fbca0ec8f4671e329e9301000000 Worker ID: 634dbf1906ac4242529c59e188f817ad98ec3ab7b4bdf841ad794e07 Node ID: 8b08680f6f77a892ba2ba3ef7e1f3a6f3ef0647ee0897833731d10a5 Worker IP address: 147.8.181.248 Worker port: 10135 Worker PID: 1044330 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. [repeated 3x across cluster]

Does anyone know how to solve such a problem? Thanks!

I switched a machine and it worked without problem. But I got OOM error on Qwen3b even with 8xH100.

same here; even OOM with 8xH20 with a 3b model

UPDATE; I solved this problem with reference to vllm-project/vllm#4392

pip3 install nvidia-cublas-cu12==12.3.4.1

@VincentXWD
Copy link
Author

VincentXWD commented Feb 12, 2025

Hello developers, Thanks for your great and impressive work and Good job! I'm trying to reproduce TinyZero, tuning qwen2.5-3B on 4 NVIDIA A100 GPUs (80GB Gmem for each) via slurm. Here's my running scripts:

#!/bin/bash
# alias python='/home/weiji/anaconda3/envs/zero/bin/python'
# alias python3='/home/weiji/anaconda3/envs/zero/bin/python3'
# alias pip='/home/weiji/anaconda3/envs/zero/bin/pip'

export N_GPUS=4
export CUDA_VISIBLE_DEVICES=0,1,2,3
ray stop --force && ray start --head --include-dashboard=True  --object-store-memory=53687091200
export BASE_MODEL="model/Qwen2.5-3B"
export DATA_DIR="/mnt/Data/wdxu/github/TinyZero_A100/dataset"
export ROLLOUT_TP_SIZE=4
export EXPERIMENT_NAME=countdown-qwen2.5-3b
export VLLM_ATTENTION_BACKEND=XFORMERS

bash ./scripts/train_tiny_zero_a100_ppo.sh
HYDRA_FULL_ERROR=1 python -m verl.trainer.main_ppo \
    data.train_files=$DATA_DIR/train.parquet \
    data.val_files=$DATA_DIR/test.parquet \
    data.train_batch_size=128 \
    data.val_batch_size=640 \
    data.max_prompt_length=256 \
    data.max_response_length=1024 \
    actor_rollout_ref.model.path=$BASE_MODEL \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size=4 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP_SIZE \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.ref.log_prob_micro_batch_size=2 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    critic.model.enable_gradient_checkpointing=True \
    critic.optim.lr=1e-5 \
    critic.model.path=$BASE_MODEL \
    critic.ppo_micro_batch_size=4 \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.logger=['wandb'] \
    +trainer.val_before_train=False \
    trainer.default_hdfs_dir=null \
    trainer.n_gpus_per_node=$N_GPUS \
    trainer.nnodes=1 \
    trainer.save_freq=10 \
    trainer.test_freq=10 \
    trainer.project_name=TinyZero \
    trainer.experiment_name=$EXPERIMENT_NAME \
    trainer.total_epochs=15 2>&1 | tee verl_demo.log

And got errors below:

2025-02-10 21:29:39,179 INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 147.8.181.248:6379...
2025-02-10 21:29:39,188 INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
(main_task pid=1042684) {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
(main_task pid=1042684)                                  'entropy_coeff': 0.001,
(main_task pid=1042684)                                  'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                                  'grad_offload': False,
(main_task pid=1042684)                                                  'optimizer_offload': False,
(main_task pid=1042684)                                                  'param_offload': False,
(main_task pid=1042684)                                                  'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                                  'grad_clip': 1.0,
(main_task pid=1042684)                                  'kl_loss_coef': 0.001,
(main_task pid=1042684)                                  'kl_loss_type': 'low_var_kl',
(main_task pid=1042684)                                  'optim': {'lr': 1e-06,
(main_task pid=1042684)                                            'lr_warmup_steps_ratio': 0.0,
(main_task pid=1042684)                                            'min_lr_ratio': None,
(main_task pid=1042684)                                            'total_training_steps': -1,
(main_task pid=1042684)                                            'warmup_style': 'constant'},
(main_task pid=1042684)                                  'ppo_epochs': 1,
(main_task pid=1042684)                                  'ppo_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                  'ppo_micro_batch_size': 4,
(main_task pid=1042684)                                  'ppo_mini_batch_size': 64,
(main_task pid=1042684)                                  'shuffle': False,
(main_task pid=1042684)                                  'strategy': 'fsdp',
(main_task pid=1042684)                                  'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)                                  'use_dynamic_bsz': False,
(main_task pid=1042684)                                  'use_kl_loss': False},
(main_task pid=1042684)                        'hybrid_engine': True,
(main_task pid=1042684)                        'model': {'enable_gradient_checkpointing': True,
(main_task pid=1042684)                                  'external_lib': None,
(main_task pid=1042684)                                  'override_config': {},
(main_task pid=1042684)                                  'path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                                  'use_remove_padding': False},
(main_task pid=1042684)                        'ref': {'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                                'param_offload': False,
(main_task pid=1042684)                                                'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                                'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                'log_prob_micro_batch_size': 2,
(main_task pid=1042684)                                'log_prob_use_dynamic_bsz': False,
(main_task pid=1042684)                                'ulysses_sequence_parallel_size': 1},
(main_task pid=1042684)                        'rollout': {'do_sample': True,
(main_task pid=1042684)                                    'dtype': 'bfloat16',
(main_task pid=1042684)                                    'enforce_eager': True,
(main_task pid=1042684)                                    'free_cache_engine': True,
(main_task pid=1042684)                                    'gpu_memory_utilization': 0.7,
(main_task pid=1042684)                                    'ignore_eos': False,
(main_task pid=1042684)                                    'load_format': 'dummy_dtensor',
(main_task pid=1042684)                                    'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=1042684)                                    'log_prob_micro_batch_size': 4,
(main_task pid=1042684)                                    'log_prob_use_dynamic_bsz': False,
(main_task pid=1042684)                                    'max_num_batched_tokens': 8192,
(main_task pid=1042684)                                    'max_num_seqs': 1024,
(main_task pid=1042684)                                    'n': 1,
(main_task pid=1042684)                                    'name': 'vllm',
(main_task pid=1042684)                                    'prompt_length': 256,
(main_task pid=1042684)                                    'response_length': 1024,
(main_task pid=1042684)                                    'temperature': 1.0,
(main_task pid=1042684)                                    'tensor_model_parallel_size': 4,
(main_task pid=1042684)                                    'top_k': -1,
(main_task pid=1042684)                                    'top_p': 1}},
(main_task pid=1042684)  'algorithm': {'adv_estimator': 'gae',
(main_task pid=1042684)                'gamma': 1.0,
(main_task pid=1042684)                'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
(main_task pid=1042684)                'kl_penalty': 'kl',
(main_task pid=1042684)                'lam': 1.0},
(main_task pid=1042684)  'critic': {'cliprange_value': 0.5,
(main_task pid=1042684)             'forward_max_token_len_per_gpu': 32768,
(main_task pid=1042684)             'forward_micro_batch_size': 4,
(main_task pid=1042684)             'grad_clip': 1.0,
(main_task pid=1042684)             'model': {'enable_gradient_checkpointing': True,
(main_task pid=1042684)                       'external_lib': None,
(main_task pid=1042684)                       'fsdp_config': {'fsdp_size': -1,
(main_task pid=1042684)                                       'grad_offload': False,
(main_task pid=1042684)                                       'optimizer_offload': False,
(main_task pid=1042684)                                       'param_offload': False,
(main_task pid=1042684)                                       'wrap_policy': {'min_num_params': 0}},
(main_task pid=1042684)                       'override_config': {},
(main_task pid=1042684)                       'path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                       'tokenizer_path': 'model/Qwen2.5-3B',
(main_task pid=1042684)                       'use_remove_padding': False},
(main_task pid=1042684)             'optim': {'lr': 1e-05,
(main_task pid=1042684)                       'lr_warmup_steps_ratio': 0.0,
(main_task pid=1042684)                       'min_lr_ratio': None,
(main_task pid=1042684)                       'total_training_steps': -1,
(main_task pid=1042684)                       'warmup_style': 'constant'},
(main_task pid=1042684)             'ppo_epochs': 1,
(main_task pid=1042684)             'ppo_max_token_len_per_gpu': 32768,
(main_task pid=1042684)             'ppo_micro_batch_size': 4,
(main_task pid=1042684)             'ppo_mini_batch_size': 64,
(main_task pid=1042684)             'shuffle': False,
(main_task pid=1042684)             'strategy': 'fsdp',
(main_task pid=1042684)             'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)             'use_dynamic_bsz': False},
(main_task pid=1042684)  'data': {'max_prompt_length': 256,
(main_task pid=1042684)           'max_response_length': 1024,
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]Dict pid=1044330)
Loading checkpoint shards:  50%|█████     | 1/2 [00:29<00:29, 29.36s/it]
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s] [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Loading checkpoint shards: 100%|██████████| 2/2 [00:46<00:00, 23.17s/it]
Loading checkpoint shards:  50%|█████     | 1/2 [00:29<00:29, 29.76s/it] [repeated 3x across cluster]

(main_task pid=1042684)           'return_raw_chat': False,
(main_task pid=1042684)           'return_raw_input_ids': False,
(main_task pid=1042684)           'tokenizer': None,
(main_task pid=1042684)           'train_batch_size': 128,
(main_task pid=1042684)           'train_files': '/mnt/Data/wdxu/github/TinyZero_A100/dataset/train.parquet',
(main_task pid=1042684)           'val_batch_size': 640,
(main_task pid=1042684)           'val_files': '/mnt/Data/wdxu/github/TinyZero_A100/dataset/test.parquet'},
(main_task pid=1042684)  'reward_model': {'enable': False,
(main_task pid=1042684)                   'forward_max_token_len_per_gpu': 32768,
(main_task pid=1042684)                   'max_length': None,
(main_task pid=1042684)                   'micro_batch_size': 64,
(main_task pid=1042684)                   'model': {'external_lib': None,
(main_task pid=1042684)                             'fsdp_config': {'min_num_params': 0,
(main_task pid=1042684)                                             'param_offload': False},
(main_task pid=1042684)                             'input_tokenizer': 'model/Qwen2.5-3B',
(main_task pid=1042684)                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
(main_task pid=1042684)                             'use_remove_padding': False},
(main_task pid=1042684)                   'strategy': 'fsdp',
(main_task pid=1042684)                   'ulysses_sequence_parallel_size': 1,
(main_task pid=1042684)                   'use_dynamic_bsz': False},
(main_task pid=1042684)  'trainer': {'critic_warmup': 0,
(main_task pid=1042684)              'default_hdfs_dir': None,
(main_task pid=1042684)              'default_local_dir': 'checkpoints/TinyZero/countdown-qwen2.5-3b',
(main_task pid=1042684)              'experiment_name': 'countdown-qwen2.5-3b',
(main_task pid=1042684)              'logger': ['wandb'],
(main_task pid=1042684)              'n_gpus_per_node': 4,
(main_task pid=1042684)              'nnodes': 1,
(main_task pid=1042684)              'project_name': 'TinyZero',
(main_task pid=1042684)              'save_freq': 10,
(main_task pid=1042684)              'test_freq': 10,
(main_task pid=1042684)              'total_epochs': 15,
(main_task pid=1042684)              'total_training_steps': None,
(main_task pid=1042684)              'val_before_train': False}}
(main_task pid=1042684) original dataset len: 327680
(main_task pid=1042684) filter dataset len: 327680
(main_task pid=1042684) original dataset len: 1024
(main_task pid=1042684) filter dataset len: 1024
(main_task pid=1042684) Size of train dataloader: 2560
(main_task pid=1042684) Size of val dataloader: 1
(main_task pid=1042684) Total training steps: 38400
(WorkerDict pid=1043246) Critic overriding config {'bos_token_id': None, 'eos_token_id': 151643, 'pad_token_id': 151643}
(WorkerDict pid=1043246) Qwen2ForTokenClassification contains 3.09B parameters
(WorkerDict pid=1043246) Before critic FSDP, memory allocated (GB): 11.497002601623535, memory reserved (GB): 11.498046875
(WorkerDict pid=1043246) NCCL version 2.20.5+cuda12.4
(WorkerDict pid=1044330) Total steps: 38400, num_warmup_steps: 0
(WorkerDict pid=1044330) Critic use_remove_padding=False
(WorkerDict pid=1043246) After critic FSDP, memory allocated (GB): 2.8744893074035645, memory reserved (GB): 24.876953125
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa3b07b0532f2ee4758a6fb3901000000 Worker ID: d72700074b743ca3ea74c39b4ca1dafbeaa822671d1793e3aeb357e0 Node ID: 8b08680f6f77a892ba2ba3ef7e1f3a6f3ef0647ee0897833731d10a5 Worker IP address: 147.8.181.248 Worker port: 10137 Worker PID: 1044332 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(WorkerDict pid=1043246) Total steps: 38400, num_warmup_steps: 0 [repeated 2x across cluster]
(WorkerDict pid=1043246) Critic use_remove_padding=False [repeated 2x across cluster]
Error executing job with overrides: ['data.train_files=/mnt/Data/wdxu/github/TinyZero_A100/dataset/train.parquet', 'data.val_files=/mnt/Data/wdxu/github/TinyZero_A100/dataset/test.parquet', 'data.train_batch_size=128', 'data.val_batch_size=640', 'data.max_prompt_length=256', 'data.max_response_length=1024', 'actor_rollout_ref.model.path=model/Qwen2.5-3B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size=4', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=4', 'actor_rollout_ref.rollout.tensor_model_parallel_size=4', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.7', 'actor_rollout_ref.ref.log_prob_micro_batch_size=2', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'critic.model.enable_gradient_checkpointing=True', 'critic.optim.lr=1e-5', 'critic.model.path=model/Qwen2.5-3B', 'critic.ppo_micro_batch_size=4', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[wandb]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=10', 'trainer.test_freq=10', 'trainer.project_name=TinyZero', 'trainer.experiment_name=countdown-qwen2.5-3b', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 193, in <module>
    main()
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 103, in main
    ray.get(main_task.remote(config))
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 2772, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wdxu/miniconda3/lib/python3.12/site-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=1042684, ip=147.8.181.248)
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/main_ppo.py", line 188, in main_task
    trainer.init_workers()
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/trainer/ppo/ray_trainer.py", line 502, in init_workers
    self.critic_wg.init_model()
  File "/mnt/Data/wdxu/github/TinyZero_A100/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: create_colocated_worker_cls.<locals>.WorkerDict
        actor_id: a3b07b0532f2ee4758a6fb3901000000
        pid: 1044332
        name: 4KTOdzWorkerDict_0:3
        namespace: 6ddbb09f-fc97-427e-a920-9ab35440c00e
        ip: 147.8.181.248
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Loading checkpoint shards: 100%|██████████| 2/2 [00:46<00:00, 23.32s/it] [repeated 3x across cluster]
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff7743fbca0ec8f4671e329e9301000000 Worker ID: 634dbf1906ac4242529c59e188f817ad98ec3ab7b4bdf841ad794e07 Node ID: 8b08680f6f77a892ba2ba3ef7e1f3a6f3ef0647ee0897833731d10a5 Worker IP address: 147.8.181.248 Worker port: 10135 Worker PID: 1044330 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. [repeated 3x across cluster]

Does anyone know how to solve such a problem? Thanks!

I switched a machine and it worked without problem. But I got OOM error on Qwen3b even with 8xH100.

same here; even OOM with 8xH20 with a 3b model

UPDATE; I solved this problem with reference to vllm-project/vllm#4392

pip3 install nvidia-cublas-cu12==12.3.4.1

[UPD] Thanks for your quick reply. After creating a new environment (mainly auto installed pytorch via installing vllm) i solved this problem. Good job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants