Abstract, First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed.
- Enhanced Kronecker Factor Analysis and Optimizer Trajectory Visualization on Loss Landscapes
- Image Classification
- Language Modeling
- Distributed Training
- Getting Started
- License
- Citation
Overview: We analyzed the Kronecker factors
- Gershgorin Circle Theorem: Used to estimate eigenvalue bounds, providing insights into layer conditioning and stability during training.
- Eigenvalue perturbation: Adding noise to off-diagonal elements showed that eigenvalues above the Kaiser rule remain unchanged, highlighting the diagonal dominance.
- Frequency Field Analysis: Leveraged FFT to identify dominant frequency patterns in Kronecker factors, shedding light on learning dynamics and feature extraction.
We also develop a new technique that visualizes trajectories across different optimizers for better understanding of model behavior in the loss landscape (please refer to Visualization_Loss_Landscape.ipynb
)
This repository provides all necessary code to employ AdaFisher and compare it with other leading SOTA optimizers. It is structured based on the RMSGD framework (paper branch). For additional details, refer to the RMSGD project.
To facilitate replication of our experimental results, we include multiple configuration files within the configs directory. To run a training session using these configurations, execute the following command:
python train.py --config config.yaml
For a list of available command-line options, you can run:
python train.py --help
AdaFisher and AdaFisherW can be seamlessly integrated into your training pipeline like any standard optimizer. Here’s how to set it up with your model:
from AdaFisher import AdaFisher, AdaFisherW
...
optimizer = AdaFisher(network, ...) # AdaFisherW(network, ...)
...
for input in data_loader:
optimizer.zero_grad()
output = network(input)
optimizer.step()
Important: When initializing AdaFisher, pass the entire network object to the optimizer, not just its parameters. This ensures that the optimizer has full context of the model architecture, which is essential for leveraging the Fisher Information effectively. Please ensure to avoid using in-place operations in the model. For instance:
# Instead of using in-place addition:
out = layer1
out = out + layer5 # Correct way
# Avoid this:
# out += layer5
# Instead of using in-place ReLU:
relu = ReLU(inplace=False) # Correct way
# Avoid this:
# relu = ReLU(inplace=True)
To train using pretrained weights, utilize the following command:
python train.py --config config.yaml --pretrained
This allows the several optimizers to leverage previously learned weights, potentially improving convergence speed and final model accuracy for complex image classification tasks.
This section of the repository focuses on training various optimizers using the WikiText-2 dataset with a compact GPT-1 network.
The dataset for this project is hosted on Google Drive. You can download it using the following link: Download Dataset
To replicate our experimental results accurately, we provide a series of executable bash scripts in the Language_Model directory. These scripts are pre-configured with all necessary settings to facilitate straightforward experimentation.
For detailed information on all available command-line options, execute the following command:
python train.py --help
AdaFisher is fully compatible with multi-GPU environments through its distributed version. To enable this functionality, ensure that the dist parameter in the AdaFisher YAML configuration file is set to True when using a distributed environment.
To set up the required environment, you can either use pip
or conda
.
You can install all required packages using:
pip install -r requirements.txt
pip install notebook
With a Python version of 3.9 or later.
First, create a new Conda environment with Python:
conda create -n AdaFisher python
Activate the environment:
conda activate AdaFisher
Then, install the required packages from the requirements.txt file:
pip install -r requirements.txt
conda install -n AdaFisher ipykernel --update-deps --force-reinstall
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
If you find this repository useful, please consider giving a star ⭐ and citing our work 📝:)
@inproceedings{
gomes2025adafisher,
title={AdaFisher: Adaptive Second Order Optimization via Fisher Information},
author={Damien MARTINS GOMES and Yanlei Zhang and Eugene Belilovsky and Guy Wolf and Mahdi S. Hosseini},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=puTxuiK2qO}
}