Vanishing Gradients: Causes and Prevention Measures

Causes

Using Sigmoid and tanh Activation Function

Vanishing Gradient usually occurs when using sigmoid activation or tanh activation function.

Sigmoid function and its derivative.
tanh function and its derivative

Weight Initialization

Sometimes when randomly initializing the weights, some weights may tend to be very small this can also lead to vanishing gradient problem

X, y = make_circles(n_samples=1000, noise=0.1, random_state=1)
scaler = MinMaxScaler(feature_range=(-1, 1))
X = scaler.fit_transform(X)
#dataset splitingn_train = 500
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
#model definitioninit = RandomUniform(minval=0, maxval=1)
model = Sequential()
model.add(Dense(5, input_dim=2, activation='tanh', kernel_initializer=init))
model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(5, activation='tanh', kernel_initializer=init))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))# compile modelopt = SGD(lr=0.01, momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
# fit modelhistory = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=500)# evaluate the model_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))# plotplt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.show()
Training and Testing Accuracy with tanh activation function
  • Model is not very accurate(Test accuracy:56%)
  • The Model is very slow to learn(for the first 200 epochs the accuracy remains the same and then increases gradually after 200 epochs)
...
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(5, activation='tanh', kernel_initializer=init))
model.add(Dense(1, activation='sigmoid', kernel_initializer=init))
print_weights = LambdaCallback(on_epoch_end=lambda batch, logs: print(model.layers[0].get_weights()))...history = model.fit(trainX, trainy, validation_data=(testX, testy),callbacks = [[print_weights]], epochs=500)...

Prevention Techniques

ReLU activation function

By using the ReLU activation instead of sigmoid and tanh activation function we can prevent vanishing gradient.

...
model = Sequential()
model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(5, activation='relu', kernel_initializer='he_uniform'))model.add(Dense(1, activation='sigmoid', kernel_initializer=init))...
Training and Testing Accuracy with ReLUelu activation function

Employing different Weight Initialization techniques

By using different weight initialization techniques we can control the amount of randomness in the weights so that the assigned weights are not very small.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store