Table of contents {: .text-delta }

Before you Begin
Multi-Layer Perceptrons Basics
Further on MLPs
Thinking about Derivatives

Before you Begin

Multi-Layer Perceptrons Basics

These are machines that can model any function in the world! For now, let’s start with simple functions like boolean gates and build our way up.

The basic working is shown below:

Perceptron as a boolean gate

Each perceptron seen above is a an addition gate
The sum is computed, and the threshold value is given by the number inside the circle
Therefore, the number dictates what type of gate it functions as

Recap types of gates:

Andrej Reference

Add gate
Max gate
Multiply gate

XOR Gate

These gates are activated only if the inputs are (1,0) or (0,1). These are bit tricky and need to be modelled with a network of perceptrons:

Therefore, it can be seen that combining MLPs in such a manner, one can say that MLPs are universal boolean functions

We can also claim that any boolean function can be modelled with just 1 hidden layer

Reason:

Why do we need depth?

Let’s take a slightly difficult case (say an XOR)

However, if we model the same with XORs depthwise, we get:

Perceptrons as Linear Classifiers

If we have 2 boolean inputs, we can have 4 combinations:

(0,0)
(0,1)
(1,0)
(1,1)

Now, using an OR gate, NOT Y gate, XOR gate we can model some basic classifiers:

Note. clearly the XOR needs to boundaries (we call these decision boundaries)
Therefore, we say that the XOR cannot be modelled with just one perceptron

Complex Decision Boundaries

If we create multiple decision boundaries, we can do the following:

Find output of each decision boundary (i.e. does my point lie to the left or right of decision boundary)
The above step happens in the hidden layer
Then we can cumulate these decision boundary inputs
From below fig. notice that only if sum == 5, the final neuron fires

This way, we can model complex geometries, even complex ones like:

Another case for depth

Now, consider the above double pentagon figure. What if we were to do it using just one layer?

We would have to approximate it using cylindrical regions (basically polygons with large number of sides, say 1000 sides)

We can then use this cylinder decision boundary (multiples of them) to sort of make up our double pentagon as shown below:

But as seen above, the major drawback is that the first layer will have an infinite number of neurons!

Now, comparing our depthwise vs spanwise solutions:

Sufficiency of Architecture

A network arch is sufficient (i.e. sufficiently braod and sufficiently deep) it can represent any function.
Conversely if a network is not sufficient, it can miss out on information, and this lack of information can be propagated deeper causing major loss of information

In the above image, if the red lines our the first layer, the information passed to the second layer is that we are in those tiny diamond regions. However, we have no idea where we are in those diamonds. (This is loss of info to the next layer!)

To mitigate this loss, instead of doing hard thresholding, we can use softer decision boundaries as shown below:

Further on MLPs

Include bias as an input for simplifying downstream computations

Bias as a separate term	Bias included in input

This also helps in simplifying the (z = Wx + b) equation from being affine to a linear form of (z = Wx)

Proceeding from simple boolean functions

We cannot handcraft our network like how we did for the double pentagon
Therefore, we need a learnable method
Also, most real functions are very complex and don’t have nice visualizations like the double pentagon
Therefore, we also need a way of learning such complex functions with only few samples and not relying on continuous data
We do this by a sampling approach, where we calculate the error for every sample in our training data

The Perceptron algorithm

Why is the perceptron algorithm not good?

The primary issue is that the simple perceptron is flat and non-differentiable.

Data is never fully clean

We mostly never have nicely linearly separable data

The solution: Differentiable activations

Now, making this activation differentiable has two benefits:

Let’s us know if our changes is having a positive or negative effect on prediction
It allows us to do backprop!

Thinking about Derivatives

Instead of thinking of derivatives as dy/dx where if we have y and x as vectors, dividing them would not make much sense, instead we define it as y' = alpha*x', where alpha is now a vector and alpha*x’ can be though of as a dot product. Therefore, this alpha will now define the vector which when dot product with x gives the direction of the fastest increase in y.
Adavantage of doing it as y' = alpha*x' now is that for a multivariate form like above, we can write the alpha vector as a partial derivate of y with x.
Now, we can clearly see how the gradient gives the direction of fastest increase in in the function. Therefore, if we want to minimize, we go in the direction exactly opposite to the gradient.