Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Benchmark] HF Trainer optimizers (Mar-2023) #22101

Closed
stas00 opened this issue Mar 11, 2023 · 3 comments
Closed

[Benchmark] HF Trainer optimizers (Mar-2023) #22101

stas00 opened this issue Mar 11, 2023 · 3 comments
Assignees
Labels
Benchmarks Issues related to Memory regressions in tests and scripts Performance

Comments

@stas00
Copy link
Contributor

stas00 commented Mar 11, 2023

This is a rerun of Adam torch vs. apex vs HF vs adafactor RTX-3090, A100 but added BNB's 8bit Adam optimizer and probably the software has improved/changed since 14 months as well.

note: 8-bit Optimizer

Actually this time it was run on a desktop PCIe 80GB A100 - so not the same hardware as the previous benchmark which was an SXM 40GB A100.

I'm using the specially written HF Trainer benchmarking tool that I developed specifically to make such benchmarks trivial to run and automatically get report tables.

So I'm running:

CUDA_VISIBLE_DEVICES=0 python scripts/benchmark/trainer-benchmark.py --base-cmd ' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: "  --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 \
' --target-metric-key train_samples_per_second --repeat-times 1 --variations '--optim adamw_torch|--optim adamw_bnb_8bit|--optim adamw_hf|--optim adafactor|--optim adamw_apex_fused' --report-metric-keys train_loss --base-variation '--optim adamw_torch'

You can see that I'm telling the tool to compare 5 optimizers: adamw_torch, adamw_bnb_8bit, adamw_hf, adafactor, adamw_apex_fused.

Memory usage wise we have per parameter:

  • 2 bytes: adamw_bnb_8bit
  • 4 bytes: adafactor
  • 8 bytes: adamw_torch, adamw_hf, adamw_apex_fused

*** Setup

When publishing benchmarks it's crucial to log the versions that were used while running those, so here we go:

Datetime    : 2023-03-10 20:55:38

Software:
transformers: 4.27.0.dev0
torch       : 1.13.1
cuda        : 11.7
python      : 3.8.15

Hardware:
1 GPUs      : NVIDIA A100 80GB PCIe, 79.21GB

*** Results

Last year's benchmark showed that the speed ups percentage was about the same between fp16/bf16/fp32. Let's see what this year brings plus a new optimizer.

FP32

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch 102.77 0 2.21
--optim adamw_bnb_8bit 104.99 2 2.15
--optim adamw_hf 103.64 1 2.21
--optim adafactor 97.22 -5 2.21
--optim adamw_apex_fused 106.12 3 2.21

Observations:

  • The results are very different from the previous year's benchmark. While Adafactor is still the slowest, the rest are are pretty close by.
  • Very surprisingly the quantized 8-bit BNB Adam optimizer is faster than pytorch's 8-byte Adam optimizer! While it uses 1/4th of the memory of the latter! And its loss is even better!

BF16

(added --bf16 to the base command line)

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch 323.18 0 2.22
--optim adamw_bnb_8bit 348.29 8 2.16
--optim adamw_hf 333.07 3 2.22
--optim adafactor 274.36 -15 2.22
--optim adamw_apex_fused 359.46 11 2.22

Observations:

  • Again BNB beats every other optimizer at loss, while being only second to apex in speed.

FP16

(added --fp16 to the base command line)

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch 370.09 0 2.55
--optim adamw_bnb_8bit 383.21 4 2.45
--optim adamw_hf 373.66 1 2.55
--optim adafactor 356.84 -4 2.53
--optim adamw_apex_fused 380.50 3 2.55

Observations:

  • Here BNB even managed to beat apex. But since I run each only once it's possible that re-running multiple times might show a slightly different outcome.
  • Somehow BF16 appears to be slower than fp16 but it gives a much better loss (same loss as fp32). I wonder why?!

new addition! --adamw_torch_fused

edit: we added --adamw_torch_fused to HF Trainer, which runs almost as fast as --adamw_apex_fused - this option requires torch>=2.0 for fp32 and bf16, and torch>2.0 for fp16 as some bugs were fixed in torch==2.0 e.g. here is fp16 comparison:

Variation Train
samples
per
second
Diff
%
Train
loss
--optim adamw_torch_fused 387.10 3 2.66
--optim adamw_torch 377.61 0 2.66
--optim adamw_apex_fused 389.49 3 2.66
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@R4ZZ3
Copy link

R4ZZ3 commented Apr 30, 2023

Could you add some Lion benchmarks?

@stas00
Copy link
Contributor Author

stas00 commented May 3, 2023

It's not in the HF Trainer's arsenal of optimizers, if you'd like to make a PR to integrate it then it can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmarks Issues related to Memory regressions in tests and scripts Performance
Projects
None yet
Development

No branches or pull requests

2 participants