You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[2024-09-02 00:39:59,451] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, input, weight, bias=None):
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
W0902 00:40:01.940000 133343758611968 torch/distributed/run.py:779]
W0902 00:40:01.940000 133343758611968 torch/distributed/run.py:779] *****************************************
W0902 00:40:01.940000 133343758611968 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0902 00:40:01.940000 133343758611968 torch/distributed/run.py:779] *****************************************
[2024-09-02 00:40:12,774] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-02 00:40:12,776] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, input, weight, bias=None):
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(ctx, input, weight, bias=None):
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
[2024-09-02 00:40:13,500] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-02 00:40:13,501] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-09-02 00:40:13,508] [INFO] [comm.py:637:init_distributed] cdb=None
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=wandb` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
09/02/2024 00:40:13 - INFO - __main__ - Distributed environment: DEEPSPEED Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 2, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/accelerator.py:401: UserWarning: `log_with=wandb` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
09/02/2024 00:40:13 - INFO - __main__ - Distributed environment: DEEPSPEED Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
Mixed precision type: bf16
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 2, 'zero_optimization': {'stage': 2, 'offload_optimizer': {'device': 'none', 'nvme_path': None}, 'offload_param': {'device': 'none', 'nvme_path': None}, 'stage3_gather_16bit_weights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.72it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.84it/s]
Init model
Init model
Loading checkpoint
Loading checkpoint
Init AE
Init AE
W0902 00:42:50.212000 133343758611968 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3788 closing signal SIGTERM
E0902 00:42:54.953000 133343758611968 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -9) local_rank: 0 (pid: 3787) of binary: /mnt/data/miniconda3/envs/x-flux/bin/python
Traceback (most recent call last):
File "/mnt/data/miniconda3/envs/x-flux/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1067, in launch_command
deepspeed_launcher(args)
File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/accelerate/commands/launch.py", line 771, in deepspeed_launcher
distrib_run.run(args)
File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/data/miniconda3/envs/x-flux/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train_flux_lora_deepspeed.py FAILED
-----------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-02_00:42:50
host : localhost
rank : 0 (local_rank: 0)
exitcode : -9 (pid: XXXX)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID XXXX
The text was updated successfully, but these errors were encountered: