Vanishing Gradients: Causes and Prevention Measures

4 min readSep 22, 2022

Underfitting and Overfitting are the easily the most common types of problems when it comes to training a Neural Network. Since they can easily overfit/underfit based on the number of neurons. but the most overlooked problem is of Exploding and Vanishing gradients.

Sometimes they can completely stop the progress of your network and break them. hence it is of utmost importance to take adequate steps in order to prevent them from happening.

Causes

Using Sigmoid and tanh Activation Function

Vanishing Gradient usually occurs when using sigmoid activation or tanh activation function.

As you can see the derivative of sigmoid function becomes zero when the input reaches large values. Hence during Backpropagation of neural networks the gradient becomes very small resulting in a very minimal change in the weights, which is not very desirable.

The derivative of tanh activation function is very similar in this regard, for large values the derivative function becomes zero.

This effect is not prominent when the no of layers in the neural network is relatively small, But when we have significant amount of layers in our neural network then due to chain rule the derivates of each layer gets multiplied as we go down the neural network. Hence the change in weight of the initial layer will be very small. This is usually undesirable since the initial weights are used to distinguish the most basic features.

Weight Initialization

Sometimes when randomly initializing the weights, some weights may tend to be very small this can also lead to vanishing gradient problem

Lets try to demonstrate this effect using a simple neural network.

X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
X = scaler.fit_transform(X)#dataset splitingn_train = 500
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]#model definitioninit = RandomUniform(minval=0, maxval=1)
model = Sequential()
model.add(Dense(5, input_dim=2, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))# compile modelopt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])# fit modelhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=500)# evaluate the model_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))# plotplt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.show()

Training and Testing Accuracy with tanh activation function

By observing the Graph of the training and testing accuracy, we can see that the:

Model is not very accurate(Test accuracy:56%)
The Model is very slow to learn(for the first 200 epochs the accuracy remains the same and then increases gradually after 200 epochs)

We can safely conclude that Vanishing gradients must be the culprit here. This can be further illustrated by considering the weights of the first layer, this can be done using Keras callbacks.

...
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(1, activation='sigmoid', kernel_initializer=init))print_weights = LambdaCallback(on_epoch_end=lambda batch, logs: print(model.layers[0].get_weights()))...history = model.fit(trainX, trainy, validation_data=(testX, testy),callbacks = [[print_weights]], epochs=500)...

From the output lets compare the weights of the layer.

weights of the first layer during the 50th epoch = [0.5584462 , 0.5673716 , 0.10358545, 0.16718468, 0.5542617 ],[0.13693556, 0.81074333, 0.65430593, 0.4081624 , 0.5088767 ]

weights of the first layer during the 100th epoch = [0.5276652 , 0.56785566, 0.08765446, 0.16682781, 0.5214859 ],[0.13223608, 0.7922534 , 0.6300024 , 0.39186594, 0.502041 ]

We can clearly see that the weights aren't drastically changing even during the initial 50 epochs.

Prevention Techniques

ReLU activation function

By using the ReLU activation instead of sigmoid and tanh activation function we can prevent vanishing gradient.

...
model = Sequential()model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))...

Training and Testing Accuracy with ReLUelu activation function

In this case the model converges much faster and provides higher accuracy.

weights of the first layer during the 50th epoch = [-1.2463173 , 1.0977441 , 0.606711 , -0.21142364, -0.22407277],[-1.3764595 , -1.158447 , 1.4483132 , -1.476005 , -0.7930375 ]

weights of the first layer during the 100th epoch = [-2.1367993 , 1.6104459 , 0.3941082 , -0.66891336, -0.24889009],[-0.33443457, -1.1000867 , 1.9572364 , -1.7306052 , -0.7959063 ]

As observed the weights in this case are changing at a much faster pace than when using tanh activation function.

Employing different Weight Initialization techniques

By using different weight initialization techniques we can control the amount of randomness in the weights so that the assigned weights are not very small.

The code used in this article makes use of RandomUniform initializer to set the minimum and maximum value for the weights.