Table of contents {: .text-delta }
Before you Begin
Improving over momentum update
Previously we saw how the derivatives change in subsequent steps (as we did in simple momentum) and take a step considering the weighted average of the current and prior step (actual implementation was a running average)
Now, we’ll consider the way in which these derivatives change (this is called second moment) which takes care of the variance in graident shifts.
The second moment can be implemented as shown below. As seen in our prior image, since we had high variation along y and low variation along x, we will do:
Commonly Used methods which use Second Moment
RMS Prop
Here, let’s do a running average like simple momentum, but do it on the second derivative of the gradient. The gamma value is just a weighting factor between prior step’s gradient (k-1) and (1-gamma) is the weight applied to the current step’s gradient:
Now, the way we will include this is our update will be to normalize the learning rate using this second second moement:
Just for comparison, this is how the update step for simple momentum only scaled the preious step’s weight magnitude and did not touch learning rate.
ADAM (RMSprop with momentum)
The reason first and second moments are scaled by the weighting factor is to ensure that in the beginning of training, we don’t let sigma and gamma terms to dominate (it’ll slow us down)
Batch Normalization
Problem with covarite shifts
Solution to covariate shifts
Batch Norm Theory
- We do this covariate shifts typically at the at location of the affine sum (Wx + b)
- The above step (first yellow box) will cause all training instances to have mean = 0 and variance = 1
- Now, we move the entire data to a separate appropriate location (second yellow box) as defined by gamma and beta.
- How do we get this gamma and beta?
- Ans. They are only learnt, we don’t define or derive them (initialize them to 0 or 1 and let them be learned)
Q. Why is batch_norm applied before the activation function? Ans. It’s debatable. But if it’s used after activation some activations may get reveresed maybe?
Note. Understand vocab: Difference between Normalization and Standardization
Now, its nice to see data having low variance. However, the real issue arises when we try to do backprop.
Backprop through Batch Norm
Conventional backprop happens by taking a derivative of the divergence function as shown below:
However, after batch norm, it gets tricky since our divergence will now depend on not only the mini-batch (training samples of mini-batch), but will now also depend on the mean and variance of the entire mini-batch (since our mini-batch was scaled and shifted according to the mean and variance)
Derivation
The derivation is shown below:
Batch Norm in Test Time
Here also we need some estimate of variance as to where this test image belongs to. We do so by using a running average over the training batches.
Overfitting
We essentially need a way to smoothen the above curve such that it fills in the gap nicely.
There are several ways of doing this, but the most common ones are:
Smoothness through weight manipulation
Think of the sigmoid function
Now, if the value of our input (x) increases a lot, the curve changes from a nice smooth curve to something a lot more steep:
(here w = our input (x))
Therefore simply constraining the weight to be low will ensure the perceptron output is smooth.
Smoothness through weight constraints (Regularization)
This is basically regularization (where we ensure model is penalized for large weights)
Now, this is also easy to backprop as shown below:
Smoothness through network structure
-
As we saw in the MLP section on why depth matters, each layer of an MLP imposes constraints, i.e. each layer creates some decision boundary.
-
In the picture below, after the first layer, we know that our input is in either a pentagon region of a triangle region (but we don’t know where inside it!)
-
-
Therefore, deeper models have a natural tendency to restricting shapes they can model and this gives the natural smoothness required.
Example for further clarity
In the above example, the earlier layers have really bad fit shapes. As we go deeper the smoothness naturally increased.
Dropout
-
During Train time, each neuron is active such that:
(number_of_instances_neuron_is_active/total_number_of_instances) = alphai.e. if the chance of a neuron being active is say 0.7 (then alpha = 0.7)
-
-
By following above steps, The effective network is different for different sets of inputs. Additionally, the graidents are also updated differently
-
Like any Bernoulli Distribution, each event has 2 outcomes. Therefore a statistical interpretation would yield the below picture:
-
-
I think it also serves as a form of augmentation, where instead of blacking out certain parts of the image, we make the object recognizable only at certain receptive fields
-
Dropout also has the tendency of removing redundancies in learning. i.e. the network learns a cat even if it doesn’t have a tail, or if it doens’t have pointy ears
Implementing Dropout during Training
The dropout is added onto the activation layer as an additional (like an if condition) constraint. It is shown below
Now, we will use this alpha value in our test time everywhere
Implementing Dropout during Inference
- We could add alpha (the bernoulli factor) to the activation of every neuron (just like train time)
- Or we could mulltiply every weight with alpha (we will effectively be blocking out connections instead of neurons)
- Instead of applying alpha as chance of a neuron being active during train time, use inverse of alpha. Then during test time, we just don’t use alpha at all!!
Augmentation
- Mosaicing
- Flipping
- Rotating
- Blurring
- Warp (Distort the image)
Other Tricks
- Normalize the input (covariate shifts in next section)
- Xavier Initialization