Underfitting and Overfitting are the easily the most common types of problems when it comes to training a Neural Network. Since they can easily overfit/underfit based on the number of neurons. but the most overlooked problem is of Exploding and Vanishing gradients.
Sometimes they can completely stop the progress of your network and break them. hence it is of utmost importance to take adequate steps in order to prevent them from happening.
Causes
Using Sigmoid and tanh Activation Function
Vanishing Gradient usually occurs when using sigmoid activation or tanh activation function.
As you can see the derivative of sigmoid function becomes zero when the input reaches large values. Hence during Backpropagation of neural networks the gradient becomes very small resulting in a very minimal change in the weights, which is not very desirable.
The derivative of tanh activation function is very similar in this regard, for large values the derivative function becomes zero.
This effect is not prominent when the no of layers in the neural network is relatively small, But when we have significant amount of layers in our neural network then due to chain rule the derivates of each layer gets multiplied as we go down the neural network. Hence the change in weight of the initial layer will be very small. This is usually undesirable since the initial weights are used to distinguish the most basic features.
Weight Initialization
Sometimes when randomly initializing the weights, some weights may tend to be very small this can also lead to vanishing gradient problem
Lets try to demonstrate this effect using a simple neural network.
X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
X = scaler.fit_transform(X)#dataset splitingn_train = 500
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]#model definitioninit = RandomUniform(minval=0, maxval=1)
model = Sequential()
model.add(Dense(5, input_dim=2, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))# compile modelopt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])# fit modelhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=500)# evaluate the model_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))# plotplt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.show()
By observing the Graph of the training and testing accuracy, we can see that the:
- Model is not very accurate(Test accuracy:56%)
- The Model is very slow to learn(for the first 200 epochs the accuracy remains the same and then increases gradually after 200 epochs)
We can safely conclude that Vanishing gradients must be the culprit here. This can be further illustrated by considering the weights of the first layer, this can be done using Keras callbacks.
...
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(1, activation='sigmoid', kernel_initializer=init))print_weights = LambdaCallback(on_epoch_end=lambda batch, logs: print(model.layers[0].get_weights()))...history = model.fit(trainX, trainy, validation_data=(testX, testy),callbacks = [[print_weights]], epochs=500)...
From the output lets compare the weights of the layer.
weights of the first layer during the 50th epoch = [0.5584462 , 0.5673716 , 0.10358545, 0.16718468, 0.5542617 ],[0.13693556, 0.81074333, 0.65430593, 0.4081624 , 0.5088767 ]
weights of the first layer during the 100th epoch = [0.5276652 , 0.56785566, 0.08765446, 0.16682781, 0.5214859 ],[0.13223608, 0.7922534 , 0.6300024 , 0.39186594, 0.502041 ]
We can clearly see that the weights aren't drastically changing even during the initial 50 epochs.
Prevention Techniques
ReLU activation function
By using the ReLU activation instead of sigmoid and tanh activation function we can prevent vanishing gradient.
...
model = Sequential()model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))...
In this case the model converges much faster and provides higher accuracy.
weights of the first layer during the 50th epoch = [-1.2463173 , 1.0977441 , 0.606711 , -0.21142364, -0.22407277],[-1.3764595 , -1.158447 , 1.4483132 , -1.476005 , -0.7930375 ]
weights of the first layer during the 100th epoch = [-2.1367993 , 1.6104459 , 0.3941082 , -0.66891336, -0.24889009],[-0.33443457, -1.1000867 , 1.9572364 , -1.7306052 , -0.7959063 ]
As observed the weights in this case are changing at a much faster pace than when using tanh activation function.
Employing different Weight Initialization techniques
By using different weight initialization techniques we can control the amount of randomness in the weights so that the assigned weights are not very small.
The code used in this article makes use of RandomUniform initializer to set the minimum and maximum value for the weights.