All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- AdaScale: Added gradient accumulation feature (#202)
- AdaScale: Added support of torch.lr_scheduler (#229)
- AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
- Pipe: documentation on balancing functions (#243)
- make sure pip package includes header files (#221)
- ShardedDataParallel with autoreduce (#157)
- cpu support for Pipe (#188)
- ShardedOptim: Distributed Grad Scaler (for torch AMP) (#182)
- OSS-aware clip grads, bridge sharded states (#167)
- oss: add rank_local_state_dict staticmethod (#174)
- support for PyTorch 1.7.0 (#171)
- Add implementation of AdaScale (#139)
- pip package install (#196, #200)
- multi-process pipe
- multiple OSS fixes
- MegaTron+OSS DDP fix
- add ddp that works with oss with reduce() not all_reduce() (#19)
- support for PyTorch v1.6
- add mixed precision Adam (#40)
- Adam optimizer state scaling (#44)
- properly restore a sharded optim state (#39)
- OSS restore state to proper device (#46)
- optim/oss: support optimizers with additional step kwargs (#53)
- optim/oss: fix state cast (#56)
- fix eval for oss_ddp (#55)
- optim/oss: work correctly with LRScheduler (#58)
- Initial release.