Activation Function
Sigmoid Function
Sigmoid is an activation function used in neural networks. It takes a real valued number and is converted into 0 and 1, this is mostly useful in decision or classification.
Sigmoid function is defined as
is the input, is the base of the natural logarithm approximately equal to 2.71828. Large negative values of becomes 0 and large positive value becomes 1. The function has an “S” shaped curve, which makes it smooth and continuous.
The derivative of sigmoid function is
This is used during backpropagation to update weights. But the weights become very small as they are propagated through the layers of network during training this causes the learning process to slow down. This causes the vanishing gradient problem as their derivatives are very small. If there are more number of layers in the network, the more times the gradients are multiplied together as they are backpropagated. This becomes exponentially small, vanishes.
tanh Function
tanh is an another activation function used in neural networks. Similar to sigmoid function this introduces non-linearity into the model.
is the base of natural logarithm.
The output lies in the range of -1 and 1.
Output becomes 1 for large positive values and -1 for large negative values.
tanh function can have faster convergence during training because tanh is zero-centered that is negative inputs are strongly mapped to negative and zero inputs to zero, weights can go in either direction.
The derivative of tanh function is
This derivative is used in backpropagation to compute the gradients.
tanh is popular in recurrent neural networks (RNN)
ReLU Function
Rectified Linear Unit (ReLU) activation is mostly used in deep learning particularly in CNN (convolutional neural network)
ReLU function is defined as:
If , then .
If , then .
ReLU captures non-linear relationships in the data. It outputs zero for all negative inputs, which leads to sparsity that is some neurons being turned off for certain inputs. This improves the model efficiency.
The derivative of ReLU, denoted as , is given by:
This is used in backpropagation process and there is no vanishing gradient problem. It prevents the gradient from shrinking as it is backpropagated through the layers.