-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Benchmark] Deepspeed +fp16/bf16 on a 8xA100 node #14913
Comments
oh, and tagging @stas00 because is a |
Information about the cards:
|
You need to understand how ZeRO stages work and their relative to each other speed: Z1: fastest - only shards optim states i.e, the more sharding it has to do the slower it becomes as it has to communicate a lot more data between processes. and of course: Z0: super fast - no ZeRO, no sharding fastest of all of them. You choose which stage to use depending on your model's size. If you can fit it with a desirable BS on Z0 use that, if you can't next try Z1, then Z2, and only if Z2 is not enough you use Z3. again Z0 - is no deepspeed. and in reverse Z3 -> Z2 -> Z1 -> Z0 your memory requirements grow, see: so it's a trade-off between memory and speed.
I'm not sure what you mean, perhaps paste the metrics you're referring to? e.g. a sample output from HF Trainer:
For benchmark I think samples/sec is the most interesting and consistent, but of course others are fine as well. e.g. see #14608
what do you mean how you could extend this to multi-node, it should just work. And if it doesn't please let us know what specifically doesn't work. additionally for multi-node benchmark reports please specify the type of inter-connects - Infiniband, OPA, etc., as these make a big difference. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
🖥 Benchmarking
transformers
Benchmark
Which part of
transformers
did you benchmark?Deepspeed
with templateZero 1, 2 and 3
configurations using fp16 and bf16.deepspeed
does not report percentages of completion nor times estimations. If there is a way to do this, please let me know and I'll extend it to 4 x (8xA100)Set-up
What did you run your benchmarks on? Please include details, such as: CPU, GPU? If using multiple GPUs, which parallelization did you use?
My system:
The command is always:
deepspeed 5.run_clm-post.py --model_name_or_path /path/to/gpt2-large/ --train_file sample.txt --tokenizer_name embeddings--do_train --do_eval --output_dir ./output --evaluation_strategy steps --eval_steps 1000 --save_steps 1000 --num_train_epochs 12 --per_device_train_batch_size 8 --cache_dir .cache2/ --save_total_limit 2 --dataloader_drop_last True --learning_rate 1e-06
And then I add:
--deepspeed config1.json --fp16
--deepspeed config2.json --fp16
--deepspeed config3.json --fp16
--deepspeed config_2.json --fp16
Where the config files are:
config1.json:
config2.json:
config3.json:
Then config_2.json is the same as the above config2 but replacing the fp16 part with:
Results
Somehow the units in the fp16 -deepspeed 1 case are returned in it/s, so for the sake of comparison that would translate to 0.43 s/it. I am puzzled by the results, because I'd expect zero 2 and 3 to work faster, but zero 1 turned to be around 10 times faster. So let me know if I am doing anything wrong. Also, let me know how could I extend to multi-node -if it is interesting for somebody else-
Thanks
The text was updated successfully, but these errors were encountered: