Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use DeepSpeed dataloader duplicate per batch #1136

Closed
Mryangkaitong opened this issue Mar 2, 2023 · 2 comments · Fixed by #1126
Closed

use DeepSpeed dataloader duplicate per batch #1136

Mryangkaitong opened this issue Mar 2, 2023 · 2 comments · Fixed by #1126

Comments

@Mryangkaitong
Copy link

Mryangkaitong commented Mar 2, 2023

when I test examples/nlp_example.py, I added a little print log on line 175(https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py#L145).

截屏2023-03-02 下午6 11 41

the default_config.yaml :

You can see that different GPU run different data in the same step. it is ok

compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: all machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true use_cpu: false
截屏2023-03-02 下午6 19 08

but when use DeepSpeed , different GPU run same data in the same step. How to understand this? Different cards running the same data?
截屏2023-03-02 下午6 12 50
ed
the default_config.yaml :
compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: ./config_blocklm.json zero3_init_flag: false distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} machine_rank: 0 main_training_function: main megatron_lm_config: {} num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true use_cpu: false

@pacman100
Copy link
Contributor

Hello @Mryangkaitong, can you check if PR #1126 fixes the above issue. Currently, if train_micro_batch_size_per_gpu isn't auto, dataloaders aren't prepared. The above PR should resolve it.

@Mryangkaitong
Copy link
Author

thanks, it is different data in the same step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants