You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
but when use DeepSpeed , different GPU run same data in the same step. How to understand this? Different cards running the same data?
ed
the default_config.yaml : compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: ./config_blocklm.json zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true use_cpu: false
The text was updated successfully, but these errors were encountered:
Hello @Mryangkaitong, can you check if PR #1126 fixes the above issue. Currently, if train_micro_batch_size_per_gpu isn't auto, dataloaders aren't prepared. The above PR should resolve it.
when I test examples/nlp_example.py, I added a little print log on line 175(https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py#L145).
the default_config.yaml :
You can see that different GPU run different data in the same step. it is ok
compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: all machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true use_cpu: false
but when use DeepSpeed , different GPU run same data in the same step. How to understand this? Different cards running the same data?
![截屏2023-03-02 下午6 12 50](https://user-images.githubusercontent.com/23132307/222399118-12c7f5b8-6019-4f19-b7e9-8b1fc0bdfebc.png)
ed
the default_config.yaml :
compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: ./config_blocklm.json zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true use_cpu: false
The text was updated successfully, but these errors were encountered: