Table of contents {: .text-delta }

Before you Begin
Binary Classifiers and Cross Entropy Loss
1. Why is Cross Entropy better than L2?
Multi-Class Cross Entropy
1. Label Smoothening
Simple 2 layer network (beautiful diagram!)
1. Backprop
Special Cases
1. Scalar vs Vector activations
  1. Example of a vector activation: softmax activation
2. Sub-gradients
Training Process
Issues with Convergence
1. Convergence for a Convex Problem
  1. Multivariate Convex
    1. The math for the above steps
  2. General Case of Convex Functions (function has higher order, i.e. not a quadratic)
    1. Issues with the above process
    2. Solutions
SGD vs Batch GD

Before you Begin

Binary Classifiers and Cross Entropy Loss

This is usually used for classification. For a binary case, it is given by the following formula:

Notice that the equation uses the term divergence here. This is actually the background term for ‘loss’ in our situation. Divergence tells us how off we are from the correct solution.
Note. divergence is not direction dependent, it just tells how far away we are from the desirable output.

More formally, loss = average divergence of our output w.r.t the ground truth

Therefore, in a binary setting, if we use softmax as our activation function -> we get the class probablity score as the output. In the binary case we get (y, 1-y) as our output.

let y = output of softmax for each class (we have 2 classes)
let d = ground truth
Now, plugging in (y, 1-y) into the Cross Entropy loss formula above
We see that when y = 0 and when y = 1, since log(0) = -infinity and log(1) = 0
Therefore, if d=0 and y=1, we get infinity, if d=1 and y=0 we also get infinity

Now it’s also interesting to observe the derivative of cross entropy loss function. The derivative is shown below:

Now, notice the following cases:

Case 1: When d=1 and y=1, plugging into the above formula, we get derivative(CE_loss) = -1
Case 2: When d=0 and y=0, plugging into above forumula we get derivative(CE_loss) = 1
Note, if you assumed that if output(y) = desired(d) would have zero gradient, you’re wrong!
The above two cases are plotted below
Case 1 Case 1 and Case 2

However, instead of cross entropy loss, if we were to use a simple L2 error (sum of sqaured diffs (quadratic function)) we would get a bowl shaped instead like:

An extract from the Xavier Intitialization paper shows this more accurately

From the above picture, one can see that the cross entropy loss surface (black) is much steeper than the quadratic surface (red)

Why is Cross Entropy better than L2?

Ans. The L2 is a quadratic loss function, which is smooth bowl. Now from the above picture,
you can see that doing gradient descent on L2 would take so much longer than using it on the steeper curve of the cross entropy loss!

Multi-Class Cross Entropy

Here we only have y_i (i.e. one class which we’re looking for in our loss function) Therefore, the derivative will look different as seen above

The problem with the above definition of CE (cross entropy) is that derivative of loss for all other classes is zero. Which isn’t desirable for fast convergence. Therefore, we slightly modify the labels ‘d’ as shown below:

Label Smoothening

Here we change our target label to (1-(K-1)*e) instead of just being 1

Simple 2 layer network (beautiful diagram!)

Backprop

Derivative w.r.t to the loss was already computer in previous section.
Now, derivative w.r.t the activation function is shown below

Example: Sigmoid Derivative

Example: Tanh Derivative

Example: Logistic Derivative

Note. tanh is a scaled and shifted version of sigmoid. This is shown below by just rearranging some terms:
Computing derivate w.r.t one weight (one weight connects one neuron in layer N-1 to
another neuron in layer 2)
Computing the derivative w.r.t y (y = output of activation function)
Here, one neuron will have effect on all the neurons in the next layer. This is why we need to sum the derivates of z (z = wx + b of next layer) w.r.t y(from previous layer) (This is explained better in the Scalar vs Vector activations)

Special Cases

Scalar vs Vector activations

We assumed activation to be neuron specific. However, this may not be the case!

Also, the backprop gets bit murky as well

Scalar Activation	Vector Activation

Note. The important aspect to remember is that for a vector activation, the derivative of divergence w.r.t any input (input to activation func) is a sum of partial derivative on every neuron of activation function as seen in picture above

Example of a vector activation: softmax activation

Sub-gradients

For RELU’s, the origin is not smooth and the gradient cannot be computed. Instead we use a
sub-gradient which is shown as multiple lines in the figure below. However, we just use the sub-gradient line which is parallel to the x-axis and define the gradient as = 1 at origin

Training Process

Vector Formulation

Forward Pass

In the below picture, the first row of weights vector represents all of the weights going to the first neuron.

Backward Pass

Now, if z = output of affine function (wx + b) and f(z) = output of activation
Having vectorized activations will cause the below backprop through y = f(z)

The Jacobian will therefore be the multivariant form of gradient, giving us the direction in which incresing delta(z) will cause the max increase in delta(y)

Rule of thumb: the derivative of [y(scalar) = f(z(matrix))] = matrix.T shape
Extension: the derivative of scalar(row_vector) = column vector

Special Cases

Backward pass summary

For the first gradient, we calculate the fastest increase in Y which gets the fastest increase in loss, where loss is given as

The derivative therefore becomes

Where Y = column vector, therefore derivative of Y w.r.t Divergence(loss) = delta(Y)*Div shown in picture above. This grad(y) is a row vector!
Now we compute derivative through affine variable z = wx+b, then we do
Now, derivative w.r.t previous Y (Y(n-1)) will be
Remember, as we go back we just post multiply by:
- A jacobian if it’s an activation layer (vector activation)
- A weight matrix for an affine layer (scalar activation)
Now, two more things tbd are derivatives on weights and biases!
Since bias should be the same size as our vector activation (z) therefore the defivative w.r.t the bias is
Similarly, the derivative of the weights is given by which will have the same shape as the weight matrix
Remember, all the shapes of the derivatives should match the shapes of the weights or the bias itself

Loss Surface

The common hypothesis is that in large networks there are lot more saddle points than global minima.
A saddle point is defined as one where moving in one direction increases loss and moving in other direction decreases. I.e. depending on which direciton you’re looking at you can be at a minima or maxima. Also the slope at saddle points is zero (therefore, you’ll get stuck with gradient descent)

Issues with Convergence

Convergence for a Convex Problem

Consider a simple quadratic case

Note. Optimizing w.r.t the second order is called Newton’s method

Multivariate Convex

Now, the A matrix will introduce different slopes in different axes since it’s multivariate. To mitigate this we do

Now, you see that the A matrix has been removed (think of it as having become identity)

Then as we saw in our derivation above, the optimal step size will then become
inverse(A) = inverse(I) = 1.

Therefore, the optimal step size is now the same in all dimensions

The math for the above steps

Points to note

In the simple scalar quadratic space the optimal step size was one value
In the multivariate space, we need to scale and then find the optimal step
However, after scaling we can still achieve single step move to global minima even in multivariate space, we just need to find the inverse of a matrix

General Case of Convex Functions (function has higher order, i.e. not a quadratic)

Even in such cases we can find Taylor expansions and just truncate upto second order
In such cases it’ll just be an approximation, but let’s live with that
Here we see that the second derivative is replaced by a Hessian
Therefore in this case, the optimum step size would be the inverse(Hessian)
The normalized and optimal update step in gradient descent form is shown below

Issues with the above process

Solutions

In the momentum method, the first term in RHS is the scaled term of previous weight update being addes to the current update step.

Big red vector = previous update step
Blue vector = 2nd term of RHS above
Small red vector = scaled version of big red vector
black vector = final update (LHS term)

SGD vs Batch GD

SGD

In SGD we update after every training instance. The caveat for convergence is that the increments should be small and not too large. The increments should also shrink so that we don’t keep shifting around the decision boundary too much due to to just one training instance.

If we define epsilon to be the margin by which we need to be within to have ‘converged’, then using the above optimal learning rate of (1/k) where k = no. of layers, we see that after one iteration we should be within (1/k)*desired_range.

Therefore if we only need to be epsilon*desired range, we can reach it in O(1/epsilon)

Batch GD

Problems with SGG

If our job was to minimize the shaded area in the below picture (shaded area = divergence) then, we would want to push the red line up or down (blue = ground truth)
If we look at the curve at it’s current location, we would want to move the red curve down drastically.
In the below picture, we would want to push up our red curve drastically
Therefore, the problem becomes that the estimated loss and subsequent update has too high a variance!
However, despite all this SGD is fast since it works only on 1 sample at a time

Middle Ground Solution: Mini Batches

Here we compute the average loss over a mini-batch and use this averaged loss for update

But how does the variance of mini-batch compare to that of full-batch gradient descent?

Variance of mini-batch GD where b = batch size is:
Variace of full-batch GD will be (1/N) instead of (1/b)
Now, if we have 1000 training samples, it can be seen that 1/100 is small enough that it won’t make that much of a difference if it’s 1/100 or 1/1000. This is why mini-batching works, i.e. even with 100 samples we capture almost the same variance as we would if we took all training samples into consideration