Tuesday, June 2, 2020

Multi Layer Perceptron

Introduction

We have already seen what a perceptron model is, its definition and an implementation using scikit-learn module of python. The implementation was done on the iris dataset. There we had also mentioned that there were certain assumptions that we needed to make for the success of the model. The most important being that the classes of the dataset need to be linearly separable. But what does it mean? Consider training a perceptron for the AND/OR gate. First let us make a truth table.



AND Gate


OR Gate

Notice that in both cases, we can draw a straight line to segregate the 1's and the 0's. That means we can create a perceptron that knows how to solve an AND and another that can solve an OR classification. Now consider the XOR gate.


XOR Gate

Here we cannot separate the 1's and the 0's with a single straight line.  So, what is the solution? Minsky and Papert(1969) offered a solution to XOR problem by combining perceptron unit responses using second layer of units. That is they took an approach of pieccewise linear classification. But then we need to be able to express this gate as a combination of two gates that are linearly separable and the gate combining the two gates should itself display same behaviour. 

An XOR can be expressed as an AND operation between NAND and OR gate. This is left as an exercise to the reader(Try it!). Since AND is linearly separable it automatically follows so will be NAND. Hence we now have two perceptrons that can be taught to handle NAND gate and an OR gate and a third to handle the AND between the results of the two. So we get something as follows:
So what we see here is that more complex problems where classes are not linearly separable can be solved using multiple perceptrons. That is the core idea behind Multilayer Perceptron. Here we ended up creating a three layered model, the input layer, the hidden layer and the output layer and eventually arrive at a function like the following:
Now what requires an explanation is what those extra layers achieve. Remember a single layer is capable of linear classification, whereas an extra hidden layer can create more than one line and segregate between classes where the former cannot be used and we end up creating a polygon. Thus adding more layers can help us in creating arbitrarily complex shape. 


With single layer
With one hidden layer


With Multiple layers

MLP is very efficient for function approximation in high-dimensional spaces. The error convergence rate of MLP is independent of the input dimensionality, while conventional linear regression methods suffer from the curse of dimensionality, which results in a decrease of the convergence rate with an increase of the input dimensionality [5]. The necessary number of MLP neurons for approximating a target function depends only upon the basic geometrical shape of the target function, and not on the dimensionality of the input space.

But then the question arises how to train the model. In a single layer perceptron, as we saw, there was back propagation of error which helped us in incrementing(recalculating) the weights in the next iteration. But here we need to calculate the weights of the internal layers. The question is how to achieve this?  From the previous two diagram and our knowledge of perceptron, it is evident that there is no direct way to know the inputs of the hidden units. The first layer's recalculation can be done because of backpropagation of error. But what about the hidden layers? 
Correct inputs of hidden layers are unknown

So as we see, unlike the perceptron model where it was straightforward to calculate the new weights, here we need a different approach to calculate the internal weights even during the first pass. This problem will certainly be addressed but for that it is imperative to first understand the architecture of the model.

Architecture of the model

  • There should not be any connections between neurons within the layer.
  • There should not be any direct connection between input and output layer, ie. there should be at least one hidden layer between input and output
  • All the neurons in layer n must be connected to all neurons in layer n+1, i.e it should be a many to many onto mapping
  • Actual implementation of the model often contains more than three layers.
  • The number of output units need not be equal to number of input unis irrespective of layers.
  • The number of hidden unis per layer can be more or less than input or output units and hence is independent of both
But that's not all. Besides the feedforward connections mentioned above, each unit of each hidden row receives an ‘error feedback’ connection from each of the units above it. However, these are not merely fanned out copies of a broadcast output (as the forward connections are), but are each separate connections, each carrying a different signal. So each unit is composed of a single sun processing element and several planet processing elements. Each planet produces an output signal that is distributed to both its sun and to the sun of the previous layer that supplied input to it. Each planet receives input from one of the suns of the previous layer as well as its own sun. As stated above, the hidden row suns receive input from one of the planets of each of the suns on the next higher row. The output row suns receive the ‘correct answer’ yfor their component of the output vector on each training trial. The network functions in two stages: a forward pass and d backward pass. A scheduling processing element sends signals to each of the processing elements of the network telling it when to apply its processing element transfer function and whether to apply the forward pass part of it or the backward pass part of it. After the transfer function is applied, the output signal is latched to the value determined during the update. This value is therefore constant until the next update. 

The scheduling of the network’s operation consists of two “sweeps” through the network. The first sweep (the forward pass) starts by inserting the vector xinto the network’s first row, the input (or fanout) layer. The processing elements of the first layer transmit all of the components of xto all of the units of the second row of the network. The outputs of the units of row two are then transmitted to all of the units of row three, and so on, until finally the m output units (the units of the top, Kth row) emit the components of the vector ŷk (the network’s estimate of the desired output yk). After the estimate ŷk is emitted, each of the output units is supplied with its component of the correct output vector yk, starting the second, backward, sweep through the network (the backward pass). The output suns compute their δkis and transmit these to their planets. The planets then update their Δkij values (or update their weights if enough batching has been carried out) and then transmit the values wkijoldδki, to the suns of the previous row. The superscript ‘‘old’’ indicates that the (non-updated) weight value used on the forward pass is used in this calculation. This process continues until the planets of row 2 (the first hidden layer) have been updated. The cycle can then be repeated. In short, each cycle consists of the inputs to the network “bubbling up” from the bottom to the top and then the errors “percolating down” from the top to the bottom.

Gradient Descent

Now that we are done discussing how the model works, let us get back to the unanswered question, i.e. how to calculate weights of the internal nodes. It is important to remember that our main target is to reduce the error, i.e. misclassification. A straightforward way of achieving this is gradient descent which tries to minimize any differentiable and continuous non linear function in an iterative fashion. So the function we define is the sum of squared error and this will be the function that we will try to minimise and then use gradient descent for error minimisation and in this way modify the weight using a learning rate

Deriving the rule:

But how do we extend it to two layers? The answer is switch to some smooth non linear unit which will act as activation function for the hidden layers



Notice the two functions. Neurons with logistic activation can only output values between 0 and 1. To enable output in a wider range of real number variants like tanh are used. Now that we have discussed the network inside out, it is time to formalise the algorithm. It is important to remember here that it has two parts, forward propagation of activity and backward propagation of error. 

The Algorithm

Part 1: Forward Propagation of activity

Step 1: Initialise weights at random, choose a learning rate η
Until network is trained:
For each training example i.e. input pattern and target output(s):
Step 2: Do forward pass through net (with fixed weights) to produce output(s)
–i.e., in Forward Direction, layer by layer:
•Inputs applied
•Multiplied by weights
•Summed
•‘Squashed’ by sigmoid activation function
•Output passed to each neuron in next layer
–Repeat above until network output(s) produced

Part 2: Backward Propagation of error:

Step 3:Compute error (delta or local gradient) for each output unit δk
Layer by layer, compute error (delta or local gradient) for each hidden unit δj
             by backpropagating errors 
Step 4: Next, update all the weights Δwij by gradient descent function and go back to step 2
             - The overall MLP learning algorithm, involving forward pass and backpropagation 
              of  error is also called Generalised Delta Rule(GDR) or more commonly the back 
              propagation(BP) algorithm.


References:
1. Artificial Neural Networks David S. Touretzky
2. MultiLayer Perceptrons: Architecture and Error backpropagation
3. Theory of the Backpropagation Neural Network by Robert Hecht-Nielsen by K. L. Du and 
M.N.S. Swamy
4. Perceptrons by Marvin L. Minsky and Seymour A. Papert


Friday, January 3, 2020

Adaboost Example

DATA: CREDIT CARD DEFAULT DATA FROM HERE

Data Description:

Data Set Characteristics: Multivariate
Number of instances: 1000
Attribute Characteristics: Integer, Categorical
Number of Attributes: 20
Missing Values: N/A.


Dataset Description:
There are 20 variables, the last one being the class(1 or 2)

Attribute 1: (qualitative) Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM / salary assignments for at least 1 year
A14 : no checking account

Attribute 2: (numerical) Duration in month

Attribute 3: (qualitative) Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)

Attribute 4: (qualitative) Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others

Attribute 5: (numerical) Credit amount

Attibute 6: (qualitative) Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account

Attribute 7: (qualitative) Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years

Attribute 8: (numerical) Installment rate in percentage of disposable income

Attribute 9: (qualitative) Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single

Attribute 10: (qualitative) Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor

Attribute 11: (numerical) Present residence since

Attribute 12: (qualitative) Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property

Attribute 13: (numerical) Age in years

Attribute 14: (qualitative) Other installment plans
A141 : bank
A142 : stores
A143 : none

Attribute 15: (qualitative) Housing
A151 : rent
A152 : own
A153 : for free

Attribute 16: (numerical) Number of existing credits at this bank

Attribute 17: (qualitative) Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer

Attribute 18: (numerical) Number of people being liable to provide maintenance for

Attribute 19: (qualitative) Telephone
A191 : none
A192 : yes, registered under the customers name

Attribute 20: (qualitative) foreign worker
A201 : yes
A202 : no


The data employed binary variable, i.e defaulted(2) or not(1). It is worse to class a customer as good when they are bad (2), than it is to class a customer as bad when they are good (1).

First, import the libraries:



Next, we read the file into a dataframe, set the name of the columns and start with EDA. First step is to check whether the data is highly unbalanced, i.e. defaulters to non defaulters ratio is too skewed or not.  


and the result as we see is:


So as is obvious, the data is not highly unbalanced as they are in a ratio of 7:3. 
Next we try to figure out effect of some of the attributes on whether a person defaults or not. We randomly select a few features and leave the rest as an exercise to the reader. it is important to note that these plots can be drawn one at a time, or take help of subplots.
  

Visualisation:

Some interesting findings:
Feature 1. Housing: Notice that the category of A124(unknown/no property) can be a risky customer.


Feature 2. Account status: Here the category of A11(< 0 DM) is found to be the maximum defaulter while A14(no checking account) are the safest bet the later part being a surprise



Feature 3. Credit History: Here things become really interesting. A30(no credits taken/ all credits paid back duly) and A31(all credits at this bank paid back duly) are surprisingly the worst performers


The result of other two features mentioned did not yield any meaningful insights. 

Next we can check by using scatter plots using pairs of variables whether we can get any clear results. For that we do a slight jugglery with the dataframe and separate the ones from the twos and concatenate the two together so that the two classes are clearly segregable as first 700 being the people who did not default as opposed to the next 300 who defaulted.  



Here we are using an example to draw a scatter plot using feature number 2 and 13 that is duration in months and age in years both being numerical data. 



The result if one notices does not reveal much. But that is not to deter the reader to try out other combinations even with categorical data. The same syntax can be followed to carry out the data analysis, just on first line instead of [1,12], they can use any other feature combination.

And now for the actual implementation of the algorithm. As we saw, not much was revealed from the analysis of data whatever we did. It is by no means exhaustive though. 

Implementation of the algorithm

Since not much information was revealed by the analysis, it can be safely assessed that all the features are equally important. But the dataset has too many categorical features. One way is to create dummy variable. Fortunately, the dataset is also presented in a way that all the values are numeric. It can be downloaded from here

So we read the data and complete the separation of train and test data.


Next step is to standardise the data. 

Last step is to implement the algorithm and check the misclassification and confusion matrix


Finally the result:

As is obvious data imbalance is causing severe problems. While the paid case, we are getting 80% accuracy, in the other case that is failure to pay is giving only 50% accuracy exposing the bank to lots of bad debt.


Saturday, August 10, 2019

Adaboost Step by Step

Example


We have discussed two types of boosting methods in our blog namely Gradient Boosting and Adaboost. Here we see a step by step implementation of Adaboost. We start with five points A,B,C,D and E where C is negative and rest are all positive. The target is to find a classifier that will classify C correctly separating it from the other points that are positive. We will use ID3 as our weak classifier.




Let's start with the following classifiers:

  1. X < 2, i.e. everything less than 2 is positive, all else negative. So as we see, the points misclassified here are B & E, thus the error is 2/5
  2. X < 4, i.e. everything less than 4 is positive, all else negative. So as we see, the points misclassified here are B, C & E, here the error being 3/5 
  3. X < 6, i.e. everything less than 6 is positive, all else negative. So as we see, the point misclassified here is C, making the error 1/5
  4. X > 2, i.e. everything greater than 2 is positive, all else negative. So as we see, the points misclassified here are A, C & D, thus the error is 3/5(1 - 2/5 from condition 1)
  5. X > 4, i.e. everything greater than 4 is positive, all else negative. So as we see, the points misclassified here are A & D, thus the error is 2/5(1 - 3/5 from condition 2)
  6. X > 6, i.e. everything greater than 6 is positive, all else negative. So as we see, the points misclassified here are A, B, D & E, thus the error is 4/5(1 - 1/5 from condition 3)
Step 1: Initialize all weights to 1/N where N is the number of training samples. Since total number of training points are five, the resulting weight for each point is 1/5. So at first round table looks like:
Table 1
Step 2: Compute the error rate for each of the weak classifiers h which is E = Σwi that is summation of all the misclassified points. 

Table 2

Step 3: Pick the classifier that has the smallest error rate, implying X < 6 is the best.  

Step 4: So we need to calculate the voting power for the best weak classifier, i.e. X < 6 and the formula is 
                                                        ½ln((1 – є)/є) 
                                                           = ½ln((1-0.2)/0.2)
                                                           = ½ln4

So H(x) = sign(1/2ln4(X<6)) where X < 6 happens to be the short hand for the classifier X < 6, which can yield ±1.

Step 5; The question to ask here is: "Are we done?". How do we decide on that. There can be several criteria:

  1. H is good enough, eg. for large dataset may be 90%, but here we have only 5, so we need to achieve 100% 
  2. Enough round is done
  3. No good classifier left.
Apparently we are not done.

Step 6: Update the weights of each of the training point, with giving more weight to the misclassified points. So,
                                            new weight = 1/2 * 1/(1 – є) * old weight      if rightly classified
                                       
                                                                  = 1/2 * 1/є * old weight                otherwise

 Table 3
A few things to note here is C, the misclassified point has the maximum weight, and that total weight of correctly classified points is same as total weight of misclassified points, and the sum of all weights is 1.

Step 7: We now calculate the error rate with the new weights to find out the next best classifier. so here is the new calculation
Table 4
and we see that best classifier is X < 2, with a misclassification of 1/4

Step 8: So we need to calculate the voting power for the best weak classifier, i.e. X < 6 and the formula is 
                                                        ½ln((1 – є)/є) 
                                                           = ½ln((1-0.25)/0.25)
                                                           = ½ln3
Note a pattern here, if the misclassification is 1/n, the voting power is X*ln(1/n-1). So:
So H(x) = sign(1/2 * ln4(X<6) + 1/2 * ln3(X<2)). Here classifier X < 6 has higher voting power, meaning if there is a conflict, this wins. This implies C is still misclassified.

Step 9: Now we need to recalculate the weight. The two misclassified points are b&e with weight ratio 1:1, and the correctly classified ones are in ratio of 1:4:1. With this information in hand and also the fact that sum of weights of misclassified points should be equal to sum of weights of correctly classified points, we can calculate the weights after round three.

Table 5

Step 10: Now we need to recalculate the error rates to find the next best classifier. So here goes the calculation
Table 6
and we see, next best classifier is X > 4. Hence going by our pattern, it's voting power should be 1/2 * ln5.
So overall classifier is H(x) = sign(1/2 * ln4(X<6) + 1/2 * ln3(X<2) + 1/2 * ln5(X>4)). 

Notice now that the first classifier misclassifies C, the second one B & E and the third one A & D. This means that whenever one of them misclassifies a point, the other two can over power it by voting power because:

ln4 + ln5 > ln 3
ln3 + ln4 > ln5
ln3 + ln5 > ln4

Now the question, what would have happened if we gave the worst classifier a negative voting power. Well let's leave that to the reader as an exercise.


Reference: https://www.youtube.com/watch?v=gmok1h8wG-Q&t=1431s

Tuesday, April 30, 2019

Adaptive Boosting a.k.a. Adaboost

INTRODUCTION

A horse-racing gambler, hoping to maximize his winnings, decides to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse, betting odds for each horse, etc.). To create such a program, he asks a highly successful expert gambler to explain his betting strategy. Not surprisingly, the expert is unable to articulate a grand set of rules for selecting a horse. On the other hand, when presented with the data for a specific set of races, the expert has no trouble coming up with a "rule of thumb" for that set of races (such as, "Bet on the horse that has recently won the most races" or "Bet on the horse with the most favored odds"). Although such a rule of thumb, by itself, is obviously very rough and inaccurate, it is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing. Furthermore, by repeatedly asking the expert's opinion on different collections of races, the gambler is able to extract many rules of thumb.

In order to use these rules of thumb to maximum advantage, there are two problems faced by the gambler: First, how should he choose the collections of races presented to the expert so as to extract rules of thumb from the expert that will be the most useful? Second, once he has collected many rules of thumb, how can they be combined into a single, highly accurate prediction rule? 

Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above.


OVERVIEW


In the previous blog post for Gradient Boosting, the idea of ensemble learning was explained in great details. Recall from that post the passing mention of two major concepts of ensemble learning, bagging and boosting. Adaptive Boosting, or Adaboost(as it is commonly known) algorithm of Freund and Schapire was the first practical boosting algorithm, and remains one of the most widely used and studied, with applications in numerous fields[1]. A lot of studies have been conducted and an equal amount of large variety of attempts have been made to understand AdaBoost as a learning algorithm, that is to understand why, how and when it works(or fails). The focus here is to understand the fundamental nature of learning that it applies. The second and more important focus is just like other posts to explain as lucidly as possible the mathematical jargon behind the algorithm to those who come from non mathematics background.

Just like other boosting algorithms, the output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. The basic idea can be summarized to the following steps:
  1. The aim is to combine different weak learners or classifiers (hi(x)) together to improve classification performance, where hi(x) is a single classifier
  2. Each weak learner is trained using a simple set of training samples, which can overlap. These sets are not disjoint
  3. Each sample has a weight that can be adjusted accordingly
  4. AdaBoost iteratively trains weak learners and calculate a weight for each one, and this weight represents the robustness of the weak learner
 EXAMPLE
All figures courtesy [3]
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Step 7:


How it works: 
  • Given N training samples, X = {(x1; y1); (x2; y2) ...  (xN; yN)}, xi  is the ith sample and yi є {-1,1}is the class label for the sample i
  • The parameters of AdaBoost classifier are initialized, the total number of iterations (T), type of weak learners (Decision trees or Neural networks), the weights wij of each training sample, where wi represents the weights of the ith iteration, and wi = [wi1, wi2, …. wiN], wij є [0,1], ΣNj=1 wij = 1. Usually the weights are initialized to be equal as follows, wij = 1/N, j = 1,...,N.
  • In AdaBoost, there are some rules:
    • AdaBoost must have some of the weak learners, e.g., decision trees or neural networks classi ers,
    • The error of each weak learner is must be better than a random classifier, i.e., the error rate must be lower than 50%


ALGORITHM

Given: (x1,y1), … , (xm,ym) where xi  є X , yi  є {-1,+1}.
Initialize: D1(i) = 1/m for i = 1,…,m.
For t = 1,…,T:
 Train weak learner using distribution Dt .
 Get weak hypothesis ht :X  ─›  {-1,+1}
Aim: select ht with low weighted error:
                                                єt = Pri~Dt [ht (xi) ≠ yi] :
Choose αt = 1/2 (ln((1 - єt)/ єt))
Update, for i = 1, … , m:
                        Dt+1(i) = Dt(i)exp(-αt yi ht(xi))/Zt
where  Zt is a normalization factor (chosen so that Dt+1 will be a distribution).
Output the final hypothesis:
                                                      H(x) = sign(ΣTt=1t ht(x))

Now to break down the whole thing step by step. 

Step 1:
To start with the first line, it means the training set contains m samples where xi are inputs and belong to the set X and where yare outputs(labels, targets etc.) and belong to a set comprising of only two values, -1 (negative class) and +1 (positive class)
Step 2:
Second D = weights of samples and i = the ith training sample. In other papers, the D will be written as W. Thus the next statement reads initialize all weights of your samples to 1 divided by number of training sample, implying summation of all of them should be 1
Step 3:
 Now for the main loop. First the notations. 
  •  Pr = probability
  •  ht = hypothesis/classifier 
  •  ε = minimum misclassification error for the model
  •  α = weight for the classifier
  •  Zt = normalization factor, used to ensure that weights represent a true distribution

 So as it happens the loop reads as for t=1 to T classifiers, fit it to the training data (where each prediction is either -1 or 1) and select the classifier with the lowest weighted classification error. Now for the formula to formally compute the value of  ε. 
where hj ≠ yi  1 if misclassified and 0 if correctly classified. Thus the formula reads: "Error equals the sum of the misclassification rate, where weight for training sample i and yi not being equal to our prediction hj (which equals 1 if misclassified and 0 if correctly classified).” 

It will require simple math to make sense of the formula. Consider having 4 different samples with weights 0.5, 0.2, 0.1 and 0.04. Imagine the classifier h predicted values 1, 1, -1 ,and -1, but the actual output value y was -1, 1, -1, 1.
predicted: 1 1 -1 -1
actual: -1 1 -1 1
weights: 0.5 0.2 0.1 0.04
1 or 0: 1 0 0 1
This leads to the following calculation for the misclassification rate:
misclassification rate / error = (0.5*1 + 0.2*0 + 0.1*0 + 0.04*1) / (0.5 + 0.2 + 0.1 + 0.04)
error = 0.64285714285


Next step is to choose weight for the classifier, α, by the formula that reads 1/2 * ln(1- error / error). Simple math might explain better than words could here. Assume for instance, that we have errors 0.30, 0.70, 0.5. Our classifier weights would be calculated as follows:
ε = 0.3
α = 1/2 * ln(1- 0.3 / 0.3) = 0.42365
ε = 0.7
α = 1/2 * ln(1- 0.7 / 0.7) = -0.42365
ε = 0.5
α = 1/2 * ln(1- 0.5 / 0.5) = 0

Notice three interesting observations:
  1. Classifier with accuracy higher than 50% results in a positive weight for the classifier (in other words, α > 0 if ε <= 0.5) 
  2. Classifier with exact 50% accuracy is 0, and thus, does not contribute to the final prediction, and 
  3. Errors 0.3 and 0.7 lead to classifier weights with inverse signs.

Now comes the very important part of the equation: updating the weights for each sample. The point of the update is to force classifiers to concentrate on observations that are difficult to correctly classify. This is done by making misclassified cases to be updated with increased weights after one iteration. Increased weights would make the learning algorithm pay higher attention to these observations in the next iteration. Conversely, correctly classified cases would receive a decreased weight, and reduced attention by the classifiers in the next iteration.

Again, with simple numbers as demonstration, information uptake is painless. Error rate of 0.3 can be used to start with to plug into the formula. Remember that the focus is on the low weighted error(i.e. below 0.5). With the low error rate, it will be interesting to observe what happens when cases are misclassified and when cases are correctly classified.



And there you have it! In the case of incorrect classification, the exp term became larger than 1, and in the case of correct classification, the exp term became below 1. Therefore, incorrect classification would receive higher weights, prompting our classifiers to pay more attention to them in the next iteration, while the opposite case of correct classification would result in the converse.


We continue this iteration until a) low training error is achieved, or b) preset number of weak learners have been added (this is a parameter that is under our control). We then take the final prediction by adding up the weighted prediction of every classifier. 



REFERENCES
  1. A Short Introduction to Boosting by Yoav Freund Robert E. Schapire
  2. Explaining AdaBoost by Robert E. Schapire
  3. AdaBoost: an Overview by Alaa Tharwat
  4. https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf