Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Failed, SIGTERM exitcode: -9 #94

Open
JimikoSK opened this issue Sep 2, 2024 · 2 comments
Open

Training Failed, SIGTERM exitcode: -9 #94

JimikoSK opened this issue Sep 2, 2024 · 2 comments

Comments

@JimikoSK
Copy link

JimikoSK commented Sep 2, 2024

[2024-09-02 00:39:59,451] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
W0902 00:40:01.940000 133343758611968 torch/distributed/run.py:779]
W0902 00:40:01.940000 133343758611968 torch/distributed/run.py:779] *****************************************
W0902 00:40:01.940000 133343758611968 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0902 00:40:01.940000 133343758611968 torch/distributed/run.py:779] *****************************************
[2024-09-02 00:40:12,774] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-02 00:40:12,776] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
[2024-09-02 00:40:13,500] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-02 00:40:13,501] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-09-02 00:40:13,508] [INFO] [comm.py:637:init_distributed] cdb=None
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=wandb` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
09/02/2024 00:40:13 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 2, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}

/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=wandb` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
09/02/2024 00:40:13 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 2, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.72it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.84it/s]
Init model
Init model
Loading checkpoint
Loading checkpoint
Init AE
Init AE
W0902 00:42:50.212000 133343758611968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3788 closing signal SIGTERM
E0902 00:42:54.953000 133343758611968 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 3787) of binary: /mnt/data/miniconda3/envs/x-flux/bin/python
Traceback (most recent call last):
  File "/mnt/data/miniconda3/envs/x-flux/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1067, in launch_command
    deepspeed_launcher(args)
  File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/commands/launch.py", line 771, in deepspeed_launcher
    distrib_run.run(args)
  File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train_flux_lora_deepspeed.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-02_00:42:50
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: XXXX)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID XXXX
model_name: "flux-dev"
data_config:
  train_batch_size: 1
  num_workers: 4
  img_size: 512
  img_dir: /The/Dir/
report_to: wandb
train_batch_size: 1
output_dir: /The/Dir/
max_train_steps: 100000
learning_rate: 1e-5
lr_scheduler: constant
lr_warmup_steps: 10
adam_beta1: 0.9
adam_beta2: 0.999
adam_weight_decay: 0.01
adam_epsilon: 1e-8
max_grad_norm: 1.0
fp8_base: true
logging_dir: logs
mixed_precision: "bf16"
checkpointing_steps: 2500
checkpoints_total_limit: 10
tracker_project_name: lora_test
resume_from_checkpoint: latest
gradient_accumulation_steps: 2
rank: 16
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
@ymz-123
Copy link

ymz-123 commented Oct 7, 2024

i have the same problem

@huayecaibcc
Copy link

Does anyone know what's going on? I've encountered this problem too. @JimikoSK @ymz-123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants