You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a rerun of Adam torch vs. apex vs HF vs adafactor RTX-3090, A100 but added BNB's 8bit Adam optimizer and probably the software has improved/changed since 14 months as well.
Actually this time it was run on a desktop PCIe 80GB A100 - so not the same hardware as the previous benchmark which was an SXM 40GB A100.
I'm using the specially written HF Trainer benchmarking tool that I developed specifically to make such benchmarks trivial to run and automatically get report tables.
Last year's benchmark showed that the speed ups percentage was about the same between fp16/bf16/fp32. Let's see what this year brings plus a new optimizer.
FP32
Variation
Train samples per second
Diff %
Train loss
--optim adamw_torch
102.77
0
2.21
--optim adamw_bnb_8bit
104.99
2
2.15
--optim adamw_hf
103.64
1
2.21
--optim adafactor
97.22
-5
2.21
--optim adamw_apex_fused
106.12
3
2.21
Observations:
The results are very different from the previous year's benchmark. While Adafactor is still the slowest, the rest are are pretty close by.
Very surprisingly the quantized 8-bit BNB Adam optimizer is faster than pytorch's 8-byte Adam optimizer! While it uses 1/4th of the memory of the latter! And its loss is even better!
BF16
(added --bf16 to the base command line)
Variation
Train samples per second
Diff %
Train loss
--optim adamw_torch
323.18
0
2.22
--optim adamw_bnb_8bit
348.29
8
2.16
--optim adamw_hf
333.07
3
2.22
--optim adafactor
274.36
-15
2.22
--optim adamw_apex_fused
359.46
11
2.22
Observations:
Again BNB beats every other optimizer at loss, while being only second to apex in speed.
FP16
(added --fp16 to the base command line)
Variation
Train samples per second
Diff %
Train loss
--optim adamw_torch
370.09
0
2.55
--optim adamw_bnb_8bit
383.21
4
2.45
--optim adamw_hf
373.66
1
2.55
--optim adafactor
356.84
-4
2.53
--optim adamw_apex_fused
380.50
3
2.55
Observations:
Here BNB even managed to beat apex. But since I run each only once it's possible that re-running multiple times might show a slightly different outcome.
Somehow BF16 appears to be slower than fp16 but it gives a much better loss (same loss as fp32). I wonder why?!
new addition! --adamw_torch_fused
edit: we added --adamw_torch_fused to HF Trainer, which runs almost as fast as --adamw_apex_fused - this option requires torch>=2.0 for fp32 and bf16, and torch>2.0 for fp16 as some bugs were fixed in torch==2.0 e.g. here is fp16 comparison:
Variation
Train samples per second
Diff %
Train loss
--optim adamw_torch_fused
387.10
3
2.66
--optim adamw_torch
377.61
0
2.66
--optim adamw_apex_fused
389.49
3
2.66
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
This is a rerun of Adam torch vs. apex vs HF vs adafactor RTX-3090, A100 but added BNB's 8bit Adam optimizer and probably the software has improved/changed since 14 months as well.
note: 8-bit Optimizer
Actually this time it was run on a desktop PCIe 80GB A100 - so not the same hardware as the previous benchmark which was an SXM 40GB A100.
I'm using the specially written HF Trainer benchmarking tool that I developed specifically to make such benchmarks trivial to run and automatically get report tables.
So I'm running:
You can see that I'm telling the tool to compare 5 optimizers:
adamw_torch
,adamw_bnb_8bit
,adamw_hf
,adafactor
,adamw_apex_fused
.Memory usage wise we have per parameter:
adamw_bnb_8bit
adafactor
adamw_torch
,adamw_hf
,adamw_apex_fused
*** Setup
When publishing benchmarks it's crucial to log the versions that were used while running those, so here we go:
*** Results
Last year's benchmark showed that the speed ups percentage was about the same between fp16/bf16/fp32. Let's see what this year brings plus a new optimizer.
FP32
samples
per
second
%
loss
Observations:
BF16
(added
--bf16
to the base command line)samples
per
second
%
loss
Observations:
FP16
(added
--fp16
to the base command line)samples
per
second
%
loss
Observations:
new addition!
--adamw_torch_fused
edit: we added
--adamw_torch_fused
to HF Trainer, which runs almost as fast as--adamw_apex_fused
- this option requirestorch>=2.0
for fp32 and bf16, andtorch>2.0
for fp16 as some bugs were fixed intorch==2.0
e.g. here is fp16 comparison:samples
per
second
%
loss
The text was updated successfully, but these errors were encountered: