Tuesday, June 2, 2020

Multi Layer Perceptron

Introduction

We have already seen what a perceptron model is, its definition and an implementation using scikit-learn module of python. The implementation was done on the iris dataset. There we had also mentioned that there were certain assumptions that we needed to make for the success of the model. The most important being that the classes of the dataset need to be linearly separable. But what does it mean? Consider training a perceptron for the AND/OR gate. First let us make a truth table.



AND Gate


OR Gate

Notice that in both cases, we can draw a straight line to segregate the 1's and the 0's. That means we can create a perceptron that knows how to solve an AND and another that can solve an OR classification. Now consider the XOR gate.


XOR Gate

Here we cannot separate the 1's and the 0's with a single straight line.  So, what is the solution? Minsky and Papert(1969) offered a solution to XOR problem by combining perceptron unit responses using second layer of units. That is they took an approach of pieccewise linear classification. But then we need to be able to express this gate as a combination of two gates that are linearly separable and the gate combining the two gates should itself display same behaviour. 

An XOR can be expressed as an AND operation between NAND and OR gate. This is left as an exercise to the reader(Try it!). Since AND is linearly separable it automatically follows so will be NAND. Hence we now have two perceptrons that can be taught to handle NAND gate and an OR gate and a third to handle the AND between the results of the two. So we get something as follows:
So what we see here is that more complex problems where classes are not linearly separable can be solved using multiple perceptrons. That is the core idea behind Multilayer Perceptron. Here we ended up creating a three layered model, the input layer, the hidden layer and the output layer and eventually arrive at a function like the following:
Now what requires an explanation is what those extra layers achieve. Remember a single layer is capable of linear classification, whereas an extra hidden layer can create more than one line and segregate between classes where the former cannot be used and we end up creating a polygon. Thus adding more layers can help us in creating arbitrarily complex shape. 


With single layer
With one hidden layer


With Multiple layers

MLP is very efficient for function approximation in high-dimensional spaces. The error convergence rate of MLP is independent of the input dimensionality, while conventional linear regression methods suffer from the curse of dimensionality, which results in a decrease of the convergence rate with an increase of the input dimensionality [5]. The necessary number of MLP neurons for approximating a target function depends only upon the basic geometrical shape of the target function, and not on the dimensionality of the input space.

But then the question arises how to train the model. In a single layer perceptron, as we saw, there was back propagation of error which helped us in incrementing(recalculating) the weights in the next iteration. But here we need to calculate the weights of the internal layers. The question is how to achieve this?  From the previous two diagram and our knowledge of perceptron, it is evident that there is no direct way to know the inputs of the hidden units. The first layer's recalculation can be done because of backpropagation of error. But what about the hidden layers? 
Correct inputs of hidden layers are unknown

So as we see, unlike the perceptron model where it was straightforward to calculate the new weights, here we need a different approach to calculate the internal weights even during the first pass. This problem will certainly be addressed but for that it is imperative to first understand the architecture of the model.

Architecture of the model

  • There should not be any connections between neurons within the layer.
  • There should not be any direct connection between input and output layer, ie. there should be at least one hidden layer between input and output
  • All the neurons in layer n must be connected to all neurons in layer n+1, i.e it should be a many to many onto mapping
  • Actual implementation of the model often contains more than three layers.
  • The number of output units need not be equal to number of input unis irrespective of layers.
  • The number of hidden unis per layer can be more or less than input or output units and hence is independent of both
But that's not all. Besides the feedforward connections mentioned above, each unit of each hidden row receives an ‘error feedback’ connection from each of the units above it. However, these are not merely fanned out copies of a broadcast output (as the forward connections are), but are each separate connections, each carrying a different signal. So each unit is composed of a single sun processing element and several planet processing elements. Each planet produces an output signal that is distributed to both its sun and to the sun of the previous layer that supplied input to it. Each planet receives input from one of the suns of the previous layer as well as its own sun. As stated above, the hidden row suns receive input from one of the planets of each of the suns on the next higher row. The output row suns receive the ‘correct answer’ yfor their component of the output vector on each training trial. The network functions in two stages: a forward pass and d backward pass. A scheduling processing element sends signals to each of the processing elements of the network telling it when to apply its processing element transfer function and whether to apply the forward pass part of it or the backward pass part of it. After the transfer function is applied, the output signal is latched to the value determined during the update. This value is therefore constant until the next update. 

The scheduling of the network’s operation consists of two “sweeps” through the network. The first sweep (the forward pass) starts by inserting the vector xinto the network’s first row, the input (or fanout) layer. The processing elements of the first layer transmit all of the components of xto all of the units of the second row of the network. The outputs of the units of row two are then transmitted to all of the units of row three, and so on, until finally the m output units (the units of the top, Kth row) emit the components of the vector ŷk (the network’s estimate of the desired output yk). After the estimate ŷk is emitted, each of the output units is supplied with its component of the correct output vector yk, starting the second, backward, sweep through the network (the backward pass). The output suns compute their δkis and transmit these to their planets. The planets then update their Δkij values (or update their weights if enough batching has been carried out) and then transmit the values wkijoldδki, to the suns of the previous row. The superscript ‘‘old’’ indicates that the (non-updated) weight value used on the forward pass is used in this calculation. This process continues until the planets of row 2 (the first hidden layer) have been updated. The cycle can then be repeated. In short, each cycle consists of the inputs to the network “bubbling up” from the bottom to the top and then the errors “percolating down” from the top to the bottom.

Gradient Descent

Now that we are done discussing how the model works, let us get back to the unanswered question, i.e. how to calculate weights of the internal nodes. It is important to remember that our main target is to reduce the error, i.e. misclassification. A straightforward way of achieving this is gradient descent which tries to minimize any differentiable and continuous non linear function in an iterative fashion. So the function we define is the sum of squared error and this will be the function that we will try to minimise and then use gradient descent for error minimisation and in this way modify the weight using a learning rate

Deriving the rule:

But how do we extend it to two layers? The answer is switch to some smooth non linear unit which will act as activation function for the hidden layers



Notice the two functions. Neurons with logistic activation can only output values between 0 and 1. To enable output in a wider range of real number variants like tanh are used. Now that we have discussed the network inside out, it is time to formalise the algorithm. It is important to remember here that it has two parts, forward propagation of activity and backward propagation of error. 

The Algorithm

Part 1: Forward Propagation of activity

Step 1: Initialise weights at random, choose a learning rate η
Until network is trained:
For each training example i.e. input pattern and target output(s):
Step 2: Do forward pass through net (with fixed weights) to produce output(s)
–i.e., in Forward Direction, layer by layer:
•Inputs applied
•Multiplied by weights
•Summed
•‘Squashed’ by sigmoid activation function
•Output passed to each neuron in next layer
–Repeat above until network output(s) produced

Part 2: Backward Propagation of error:

Step 3:Compute error (delta or local gradient) for each output unit δk
Layer by layer, compute error (delta or local gradient) for each hidden unit δj
             by backpropagating errors 
Step 4: Next, update all the weights Δwij by gradient descent function and go back to step 2
             - The overall MLP learning algorithm, involving forward pass and backpropagation 
              of  error is also called Generalised Delta Rule(GDR) or more commonly the back 
              propagation(BP) algorithm.


References:
1. Artificial Neural Networks David S. Touretzky
2. MultiLayer Perceptrons: Architecture and Error backpropagation
3. Theory of the Backpropagation Neural Network by Robert Hecht-Nielsen by K. L. Du and 
M.N.S. Swamy
4. Perceptrons by Marvin L. Minsky and Seymour A. Papert