[Benchmark] HF Trainer optimizers (Mar-2023) #22101

stas00 · 2023-03-11T05:06:50Z

This is a rerun of Adam torch vs. apex vs HF vs adafactor RTX-3090, A100 but added BNB's 8bit Adam optimizer and probably the software has improved/changed since 14 months as well.

note: 8-bit Optimizer

Actually this time it was run on a desktop PCIe 80GB A100 - so not the same hardware as the previous benchmark which was an SXM 40GB A100.

I'm using the specially written HF Trainer benchmarking tool that I developed specifically to make such benchmarks trivial to run and automatically get report tables.

So I'm running:

CUDA_VISIBLE_DEVICES=0 python scripts/benchmark/trainer-benchmark.py --base-cmd ' \
examples/pytorch/translation/run_translation.py --model_name_or_path t5-base --output_dir output_dir \
--do_train --label_smoothing 0.1 --logging_strategy no --save_strategy no --per_device_train_batch_size 32 \
--max_source_length 512 --max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: "  --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 \
' --target-metric-key train_samples_per_second --repeat-times 1 --variations '--optim adamw_torch|--optim adamw_bnb_8bit|--optim adamw_hf|--optim adafactor|--optim adamw_apex_fused' --report-metric-keys train_loss --base-variation '--optim adamw_torch'

You can see that I'm telling the tool to compare 5 optimizers: adamw_torch, adamw_bnb_8bit, adamw_hf, adafactor, adamw_apex_fused.

Memory usage wise we have per parameter:

2 bytes: adamw_bnb_8bit
4 bytes: adafactor
8 bytes: adamw_torch, adamw_hf, adamw_apex_fused

*** Setup

When publishing benchmarks it's crucial to log the versions that were used while running those, so here we go:

Datetime    : 2023-03-10 20:55:38

Software:
transformers: 4.27.0.dev0
torch       : 1.13.1
cuda        : 11.7
python      : 3.8.15

Hardware:
1 GPUs      : NVIDIA A100 80GB PCIe, 79.21GB

*** Results

Last year's benchmark showed that the speed ups percentage was about the same between fp16/bf16/fp32. Let's see what this year brings plus a new optimizer.

FP32

Variation	Train samples per second	Diff %	Train loss
--optim adamw_torch	102.77	0	2.21
--optim adamw_bnb_8bit	104.99	2	2.15
--optim adamw_hf	103.64	1	2.21
--optim adafactor	97.22	-5	2.21
--optim adamw_apex_fused	106.12	3	2.21

Observations:

The results are very different from the previous year's benchmark. While Adafactor is still the slowest, the rest are are pretty close by.
Very surprisingly the quantized 8-bit BNB Adam optimizer is faster than pytorch's 8-byte Adam optimizer! While it uses 1/4th of the memory of the latter! And its loss is even better!

BF16

(added --bf16 to the base command line)

Variation	Train samples per second	Diff %	Train loss
--optim adamw_torch	323.18	0	2.22
--optim adamw_bnb_8bit	348.29	8	2.16
--optim adamw_hf	333.07	3	2.22
--optim adafactor	274.36	-15	2.22
--optim adamw_apex_fused	359.46	11	2.22

Observations:

Again BNB beats every other optimizer at loss, while being only second to apex in speed.

FP16

(added --fp16 to the base command line)

Variation	Train samples per second	Diff %	Train loss
--optim adamw_torch	370.09	0	2.55
--optim adamw_bnb_8bit	383.21	4	2.45
--optim adamw_hf	373.66	1	2.55
--optim adafactor	356.84	-4	2.53
--optim adamw_apex_fused	380.50	3	2.55

Observations:

Here BNB even managed to beat apex. But since I run each only once it's possible that re-running multiple times might show a slightly different outcome.
Somehow BF16 appears to be slower than fp16 but it gives a much better loss (same loss as fp32). I wonder why?!

new addition! `--adamw_torch_fused`

edit: we added --adamw_torch_fused to HF Trainer, which runs almost as fast as --adamw_apex_fused - this option requires torch>=2.0 for fp32 and bf16, and torch>2.0 for fp16 as some bugs were fixed in torch==2.0 e.g. here is fp16 comparison:

Variation	Train samples per second	Diff %	Train loss
--optim adamw_torch_fused	387.10	3	2.66
--optim adamw_torch	377.61	0	2.66
--optim adamw_apex_fused	389.49	3	2.66

The text was updated successfully, but these errors were encountered:

github-actions · 2023-04-21T15:02:16Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

R4ZZ3 · 2023-04-30T23:03:23Z

Could you add some Lion benchmarks?

stas00 · 2023-05-03T16:20:25Z

It's not in the HF Trainer's arsenal of optimizers, if you'd like to make a PR to integrate it then it can be done.

huggingface/transformers#22101 (comment)

stas00 mentioned this issue Mar 11, 2023

[Benchmarks] index #14996

Open

stas00 added Benchmarks Issues related to Memory regressions in tests and scripts Performance labels Mar 11, 2023

stas00 self-assigned this Mar 11, 2023

stas00 mentioned this issue Mar 18, 2023

Fused AdamW has worse loss than Apex and unfused AdamW for fp16/AMP pytorch/pytorch#96755

Closed

github-actions bot closed this as completed Apr 29, 2023

Qubitium pushed a commit to Qubitium/alpaca-lora that referenced this issue May 3, 2023

change default optimizer to adamw_bnb_8bit due to

b1e6ed4

huggingface/transformers#22101 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] HF Trainer optimizers (Mar-2023) #22101

[Benchmark] HF Trainer optimizers (Mar-2023) #22101

stas00 commented Mar 11, 2023 •

edited

Loading

github-actions bot commented Apr 21, 2023

R4ZZ3 commented Apr 30, 2023

stas00 commented May 3, 2023 •

edited

Loading

[Benchmark] HF Trainer optimizers (Mar-2023) #22101

[Benchmark] HF Trainer optimizers (Mar-2023) #22101

Comments

stas00 commented Mar 11, 2023 • edited Loading

FP32

BF16

FP16

new addition! --adamw_torch_fused

github-actions bot commented Apr 21, 2023

R4ZZ3 commented Apr 30, 2023

stas00 commented May 3, 2023 • edited Loading

stas00 commented Mar 11, 2023 •

edited

Loading

new addition! `--adamw_torch_fused`

stas00 commented May 3, 2023 •

edited

Loading