Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
adjust local learning rate and decay according to gradient accumulation
Divide local rate by `iter_size` to normalize the gradient according to the full minibatch size and not only the computational batch size. Multiply the local decay by `iter_size` to counter the division of the local learning rate since the decay is multiplied by the rate in the update equation.
- Loading branch information