Data jabberwocky: June 2020

Introduction

We have already seen what a perceptron model is, its definition and an implementation using scikit-learn module of python. The implementation was done on the iris dataset. There we had also mentioned that there were certain assumptions that we needed to make for the success of the model. The most important being that the classes of the dataset need to be linearly separable. But what does it mean? Consider training a perceptron for the AND/OR gate. First let us make a truth table.

AND Gate

OR Gate

Notice that in both cases, we can draw a straight line to segregate the 1's and the 0's. That means we can create a perceptron that knows how to solve an AND and another that can solve an OR classification. Now consider the XOR gate.

XOR Gate

Here we cannot separate the 1's and the 0's with a single straight line. So, what is the solution? Minsky and Papert(1969) offered a solution to XOR problem by combining perceptron unit responses using second layer of units. That is they took an approach of pieccewise linear classification. But then we need to be able to express this gate as a combination of two gates that are linearly separable and the gate combining the two gates should itself display same behaviour.

An XOR can be expressed as an AND operation between NAND and OR gate. This is left as an exercise to the reader(Try it!). Since AND is linearly separable it automatically follows so will be NAND. Hence we now have two perceptrons that can be taught to handle NAND gate and an OR gate and a third to handle the AND between the results of the two. So we get something as follows:

So what we see here is that more complex problems where classes are not linearly separable can be solved using multiple perceptrons. That is the core idea behind Multilayer Perceptron. Here we ended up creating a three layered model, the input layer, the hidden layer and the output layer and eventually arrive at a function like the following:

Now what requires an explanation is what those extra layers achieve. Remember a single layer is capable of linear classification, whereas an extra hidden layer can create more than one line and segregate between classes where the former cannot be used and we end up creating a polygon. Thus adding more layers can help us in creating arbitrarily complex shape.

With single layer

With one hidden layer

With Multiple layers

MLP is very efficient for function approximation in high-dimensional spaces. The error convergence rate of MLP is independent of the input dimensionality, while conventional linear regression methods suffer from the curse of dimensionality, which results in a decrease of the convergence rate with an increase of the input dimensionality [5]. The necessary number of MLP neurons for approximating a target function depends only upon the basic geometrical shape of the target function, and not on the dimensionality of the input space.

But then the question arises how to train the model. In a single layer perceptron, as we saw, there was back propagation of error which helped us in incrementing(recalculating) the weights in the next iteration. But here we need to calculate the weights of the internal layers. The question is how to achieve this? From the previous two diagram and our knowledge of perceptron, it is evident that there is no direct way to know the inputs of the hidden units. The first layer's recalculation can be done because of backpropagation of error. But what about the hidden layers?

Correct inputs of hidden layers are unknown

So as we see, unlike the perceptron model where it was straightforward to calculate the new weights, here we need a different approach to calculate the internal weights even during the first pass. This problem will certainly be addressed but for that it is imperative to first understand the architecture of the model.

Architecture of the model

There should not be any connections between neurons within the layer.
There should not be any direct connection between input and output layer, ie. there should be at least one hidden layer between input and output
All the neurons in layer n must be connected to all neurons in layer n+1, i.e it should be a many to many onto mapping
Actual implementation of the model often contains more than three layers.
The number of output units need not be equal to number of input unis irrespective of layers.
The number of hidden unis per layer can be more or less than input or output units and hence is independent of both

But that's not all. Besides the feedforward connections mentioned above, each unit of each hidden row receives an ‘error feedback’ connection from each of the units above it. However, these are not merely fanned out copies of a broadcast output (as the forward connections are), but are each separate connections, each carrying a different signal. So each unit is composed of a single sun processing element and several planet processing elements. Each planet produces an output signal that is distributed to both its sun and to the sun of the previous layer that supplied input to it. Each planet receives input from one of the suns of the previous layer as well as its own sun. As stated above, the hidden row suns receive input from one of the planets of each of the suns on the next higher row. The output row suns receive the ‘correct answer’ y_ifor their component of the output vector on each training trial. The network functions in two stages: a forward pass and d backward pass. A scheduling processing element sends signals to each of the processing elements of the network telling it when to apply its processing element transfer function and whether to apply the forward pass part of it or the backward pass part of it. After the transfer function is applied, the output signal is latched to the value determined during the update. This value is therefore constant until the next update.

The scheduling of the network’s operation consists of two “sweeps” through the network. The first sweep (the forward pass) starts by inserting the vector x_kinto the network’s first row, the input (or fanout) layer. The processing elements of the first layer transmit all of the components of x_kto all of the units of the second row of the network. The outputs of the units of row two are then transmitted to all of the units of row three, and so on, until finally the m output units (the units of the top, K^throw) emit the components of the vector ŷ_k (the network’s estimate of the desired output y_k). After the estimate ŷ_k is emitted, each of the output units is supplied with its component of the correct output vector y_k, starting the second, backward, sweep through the network (the backward pass). The output suns compute their δ_kis and transmit these to their planets. The planets then update their Δ_kij values (or update their weights if enough batching has been carried out) and then transmit the values w_kij^oldδ_ki, to the suns of the previous row. The superscript ‘‘old’’ indicates that the (non-updated) weight value used on the forward pass is used in this calculation. This process continues until the planets of row 2 (the first hidden layer) have been updated. The cycle can then be repeated. In short, each cycle consists of the inputs to the network “bubbling up” from the bottom to the top and then the errors “percolating down” from the top to the bottom.

Gradient Descent

Now that we are done discussing how the model works, let us get back to the unanswered question, i.e. how to calculate weights of the internal nodes. It is important to remember that our main target is to reduce the error, i.e. misclassification. A straightforward way of achieving this is gradient descent which tries to minimize any differentiable and continuous non linear function in an iterative fashion. So the function we define is the sum of squared error and this will be the function that we will try to minimise and then use gradient descent for error minimisation and in this way modify the weight using a learning rate

Deriving the rule:

But how do we extend it to two layers? The answer is switch to some smooth non linear unit which will act as activation function for the hidden layers.

Notice the two functions. Neurons with logistic activation can only output values between 0 and 1. To enable output in a wider range of real number variants like tanh are used. Now that we have discussed the network inside out, it is time to formalise the algorithm. It is important to remember here that it has two parts, forward propagation of activity and backward propagation of error.

The Algorithm

Part 1: Forward Propagation of activity

•Step 1: Initialise weights at random, choose a learning rate η

•Until network is trained:

•For each training example i.e. input pattern and target output(s):

•Step 2: Do forward pass through net (with fixed weights) to produce output(s)

–i.e., in Forward Direction, layer by layer:

•Inputs applied

•Multiplied by weights

•Summed

•‘Squashed’ by sigmoid activation function

•Output passed to each neuron in next layer

–Repeat above until network output(s) produced

Part 2: Backward Propagation of error:

Step 3:Compute error (delta or local gradient) for each output unit δ_k

Layer by layer, compute error (delta or local gradient) for each hidden unit δ_j

by backpropagating errors

Step 4: Next, update all the weights Δw_ijby gradient descent function and go back to step 2

- The overall MLP learning algorithm, involving forward pass and backpropagation

of error is also called Generalised Delta Rule(GDR) or more commonly the back

propagation(BP) algorithm.

References:

1. Artificial Neural Networks David S. Touretzky

2. MultiLayer Perceptrons: Architecture and Error backpropagation

3. Theory of the Backpropagation Neural Network by Robert Hecht-Nielsen by K. L. Du and

M.N.S. Swamy

4. Perceptrons by Marvin L. Minsky and Seymour A. Papert

Data jabberwocky

Tuesday, June 2, 2020

Multi Layer Perceptron

Introduction

Architecture of the model

Followers

Popular Posts