This repository hosts the PyTorch code to implement the paper Adaptive-saturated RNN: Remember more with less instability (ICLR 2023, Tiny Paper Track)
Authors: Khoi Minh Nguyen-Duy, Quang Pham and Binh Thanh Nguyen
Please open a Github issue or email to ngtbinh@hcmus.edu.vn if you need further information.
If you find the paper or the source code useful, please consider about supporting our works by citing
@misc{
nguyen-duy2023adaptivesaturated,
title={Adaptive-saturated {RNN}: Remember more with less instability},
author={Khoi Minh Nguyen-Duy and Quang Pham and Binh T. Nguyen},
year={2023},
url={https://openreview.net/forum?id=Ihzsru2bw2}
}
Orthogonal parameterization has offered a compelling solution to the vanishing gradient problem (VGP) in recurrent neural networks (RNNs). Thanks to orthogonal parameters and non-saturated activation functions, gradients in such models are constrained to unit norms. On the other hand, although the traditional vanilla RNN have been observed to possess higher memory capacity, they suffer from the VGP and perform badly in many applications. This work connects the aforementioned approaches by proposing Adaptive-Saturated RNNs (asRNN), a variant that dynamically adjusts the saturation level between the two. Consequently, asRNN enjoys both the capacity of a vanilla RNN and the training stability of orthogonal RNNs. Our experiments show encouraging results of asRNN on challenging sequence learning benchmarks compared to several strong competitors.
Formulation We formally define the hidden cell of asRNN as:
- Details of the implementation can be found in Appendix A.4 of our paper
- Details of the hyperparameter setting can be found in Hyperparameter.md
Model | #PARAMS | hidden_size | sMNIST | pMNIST |
---|---|---|---|---|
asRNN | ||||
expRNN | ||||
scoRNN | ||||
asRNN | ||||
expRNN | ||||
scoRNN | ||||
LSTM | ||||
asRNN | ||||
expRNN | ||||
scoRNN |
Recall Length
Recall Length
Model | #PARAMS | hidden_size | ||
---|---|---|---|---|
LSTM | ||||
asRNN | ||||
expRNN |
For asRNN replication, use default setting. Otherwise, please read the details of the hyperparameter setting.
Linux environment variable:
export 'CUBLAS_WORKSPACE_CONFIG=:4096:8'
echo $CUBLAS_WORKSPACE_CONFIG
python copytask.py [args]
Options:
- recall_length
- delay_length
- random-seed
- iterations
- batch_size
- hidden_size
- rmsprop_lr: learning rate
- rmsprop_constr_lr: learning rate of
$W_{hh}$ - alpha : rmsprop smoothing constant
- clip_norm: norm threshold for gradient clipping. Set negative to disable.
- mode: choices=["exprnn", "dtriv", "cayley", "lstm", "rnn"] (see https://github.com/Lezcano/expRNN)
- init: choices=["cayley", "henaff"] -
$\ln(W_{hh})$ initialization scheme - nonlinear: choices=["asrnn", "modrelu"]
- a: asRNN hyperparameter
- b: asRNN hyperparameter
- eps: asRNN hyperparameter
- rho_rat_den:
$\frac{1}{\rho}$ , a hyperparameter for scoRNN. - forget_bias
- K: see here
python MNIST.py [args]
Options:
- permute: True or False (pMNIST or sMNIST).
- random-seed
- epochs
- batch_size
- hidden_size
- rmsprop_lr: learning rate
- rmsprop_constr_lr: learning rate of
$W_{hh}$ - alpha : rmsprop smoothing constant
- clip_norm: norm threshold for gradient clipping. Set negative to disable.
- mode: choices=["exprnn", "dtriv", "cayley", "lstm", "rnn"] (see https://github.com/Lezcano/expRNN)
- init: choices=["cayley", "henaff"] -
$\ln(W_{hh})$ initialization scheme - nonlinear: choices=["asrnn", "modrelu"]
- a: asRNN hyperparameter
- b: asRNN hyperparameter
- eps: asRNN hyperparameter
- rho_rat_den:
$\frac{1}{\rho}$ , a hyperparameter for scoRNN. - forget_bias
- K: see here
Prepare the dataset:
- Download the dataset here
- Extract 'ptb.char.train.txt’, 'ptb.char.valid.txt’, 'ptb.char.test.txt’ from ./simple-examples/data into ./Dataset/PTB
python pennchar.py [args]
Options:
- bptt: back propagation through time length
- emsize: size of word embedding
- log-interval: report batch interval
- epochs: number of iterations
- batch_size
- hidden_size
- rmsprop_lr: learning rate
- rmsprop_constr_lr: learning rate of
$W_{hh}$ - alpha : rmsprop smoothing constant
- clip_norm: norm threshold for gradient clipping. Set negative to disable.
- mode: choices=["exprnn", "dtriv", "cayley", "lstm", "rnn"] (see https://github.com/Lezcano/expRNN)
- init: choices=["cayley", "henaff"] -
$\ln(W_{hh})$ initialization scheme - nonlinear: choices=["asrnn", "modrelu"]
- a: asRNN hyperparameter
- b: asRNN hyperparameter
- eps: asRNN hyperparameter
- rho_rat_den:
$\frac{1}{\rho}$ , a hyperparameter for scoRNN. - forget_bias
- K: see here