Activation functions (or transfer functions) are linear or non linear equations which process the output of a Neural Network neuron and bound it into a limit range of values (commonly
Many activation functions have been proposed during the years and each one has its characteristics, but not an appropriated application field.
The best activation function to use in a given situation (to a particular problem) is still an open question.
Each one has its pros and cons in some situations, so each Neural Network library implements a wide range of them and it leaves to the user to perform his own tests.
In the following Table we show the list of activation functions implemented in our NumPyNet
and Byron
libraries, with mathematical formulation and corresponding derivative (ref. activations.py
for the code implementation).
An important feature of any activation function, in fact, is that it should be differentiable since the main procedure of model optimization implies the back-propagation of the error gradients.
Name | Equation | Derivative |
---|---|---|
Linear | ||
Logistic | ||
Loggy | ||
Relu | $$f'(x) = \left{\begin{array}{rlr} 1 & \mbox{if} & f(x) > 0 \ 0 & \mbox{if} & f(x) \leq 0 \end{array}\right.$$ | |
Elu | $$f'(x) = \left{\begin{array}{rlr} 1 & \mbox{if} & f(x) \geq 0 \ f(x) + 1 & \mbox{if} & f(x) < 0 \end{array}\right.$$ | |
Relie | $$f'(x) = \left{\begin{array}{rlr} 1 & \mbox{if} & f(x) > 0 \ 1e-2 & \mbox{if} & f(x) \leq 0 \end{array}\right.$$ | |
Ramp | $$f(x) = \left{\begin{array}{rlr} x^2 + 0.1 * x^2 & \mbox{if} & x > 0 \ 0 & \mbox{if} & x \leq 0 \end{array}\right.$$ | $$f'(x) = \left{\begin{array}{rlr} f(x) + 1 & \mbox{if} & f(x) > 0 \ f(x) & \mbox{if} & f(x) \leq 0 \end{array}\right.$$ |
Tanh | ||
Plse | $$f(x) = \left{\begin{array}{rlr} (x + 4) * 1e-2 & \mbox{if} & x < -4 \ (x - 4) * 1e-2 + 1 & \mbox{if} & x > 4 \ x * 0.125 + 5 & \mbox{if} & -4 \leq x \leq 4\end{array}\right.$$ | $$f'(x) = \left{\begin{array}{rlr} 1e-2 & \mbox{if } & f(x) < 0 \mbox{ or } f(x) > 1 \ 0.125 & \mbox{if } & 0 \leq f(x) \leq 1 \end{array}\right.$$ |
Leaky | $$f(x) = \left{\begin{array}{rlr} x * C & \mbox{if} & x \leq 0 \ x & \mbox{if} & x > 0\end{array}\right.$$ | $$f'(x) = \left{\begin{array}{rlr} 1 & \mbox{if} & f(x) > 0 \ C & \mbox{if} & f(x) \leq 0 \end{array}\right.$$ |
HardTan | $$f(x) = \left{\begin{array}{rlr} -1 & \mbox{if} & x < -1 \ +1 & \mbox{if} & x > 1 \ x & \mbox{if} & -1 \leq x \leq 1\end{array}\right.$$ | $$f'(x) = \left{\begin{array}{rlr} 0 & \mbox{if} & f(x) < -1 \mbox{ or } f(x) > 1 \ 1 & \mbox{if} & -1 \leq f(x) \leq 1\end{array}\right.$$ |
LhTan | $$f(x) = \left{\begin{array}{rlr} x * 1e-3 & \mbox{if} & x < 0 \ (x - 1) * 1e-3 + 1 & \mbox{if} & x > 1 \ x & \mbox{if} & 0 \leq x \leq 1\end{array}\right.$$ | $$f'(x) = \left{\begin{array}{rlr} 1e-3 & \mbox{if} & f(x) < 0 \mbox{ or } f(x) > 1 \ 1 & \mbox{if} & 0 \leq f(x) \leq 1 \end{array}\right.$$ |
Selu | $$f(x) =\left{\begin{array}{rlr} 1.0507 * 1.6732 * (e^x - 1) & \mbox{if} & x < 0 \ x * 1.0507 & \mbox{if} & x \geq 0 \end{array}\right.$$ | $$f'(x) = \left{\begin{array}{rlr} f(x) * 1e-3 & \mbox{if} & f(x) 0 \ (f(x) - 1) * 1e-3 + 1 & \mbox{if} & f(x) > 1 \ \end{array}\right.$$ |
SoftPlus | ||
SoftSign | $$f(x) = \frac{x}{ | x |
Elliot | $$f(x) = \frac{\frac{1}{2} * S * x}{1 + | x + S |
SymmElliot | $$f(x) = \frac{S * x}{1 + | x * S |
As can be seen in the table it is easier to compute the activation function derivative as function of it. This is a (well known) important type of optimization in computation term, since it reduces the number of operations and it allows to apply the backward gradient directly.
To better understand the effects of activation functions, we can apply these functions on a test image.
This can be easy done using the example scripts inserted inside our NumPyNet
library.
In Fig. 1 the effects of the previously described functions are reported on a test image.
For each function we show the output of the activation function and its gradient.
For visualization purposes the image values have been rescaled
From the results given in Figure we can better appreciate the differences between the several mathematical formulas: a simple Logistic function does not produce evident effects on the test image while a Relu activation tends to overshadow the image pixels. This feature of the Relu activation function is very useful in Neural Network model and is also determines important theoretical consequences, which led it to be one of the most prominent solution for many Neural Network models.
The ReLU (Rectified Linear Unit) activation function is, in fact, the most used into the modern Neural Networks models. Its diffusion is imputed to its numerical efficiency and to the benefits it brings [Glorot2011Relu]:
-
Information disentangling: the main purpose of a Neural Network model is to tune a discriminant function able to associate a set of input to a prior-known output classes. A dense information representation is considered entangled because small differences in input highly modify the data representation inside the network. On the other hand, a sparse representation tends to guarantee a conservation of the learning features.
-
Information representation: different inputs can lead different quantities of useful information. The possibility to have null values in output (ref Table) allows a better representation of the dimensions inside the network.
-
Sparsity: sparsity representation of data is exponentially efficient in comparison to dense one, where the exponential power is given by the number of no-null features [Glorot2011Relu].
-
Vanish gradient reduction: if the activation output is positive we have a no-bound gradient value.
In the next sections we will discuss about different kind of Neural Network models and in all of them we choose to use Relu activation function in the major part of the layers.
Footnotes
-
We assume for simplicity a fully connected Neural Network neuron. ↩