Saturday, August 10, 2019

Adaboost Step by Step

Example


We have discussed two types of boosting methods in our blog namely Gradient Boosting and Adaboost. Here we see a step by step implementation of Adaboost. We start with five points A,B,C,D and E where C is negative and rest are all positive. The target is to find a classifier that will classify C correctly separating it from the other points that are positive. We will use ID3 as our weak classifier.




Let's start with the following classifiers:

  1. X < 2, i.e. everything less than 2 is positive, all else negative. So as we see, the points misclassified here are B & E, thus the error is 2/5
  2. X < 4, i.e. everything less than 4 is positive, all else negative. So as we see, the points misclassified here are B, C & E, here the error being 3/5 
  3. X < 6, i.e. everything less than 6 is positive, all else negative. So as we see, the point misclassified here is C, making the error 1/5
  4. X > 2, i.e. everything greater than 2 is positive, all else negative. So as we see, the points misclassified here are A, C & D, thus the error is 3/5(1 - 2/5 from condition 1)
  5. X > 4, i.e. everything greater than 4 is positive, all else negative. So as we see, the points misclassified here are A & D, thus the error is 2/5(1 - 3/5 from condition 2)
  6. X > 6, i.e. everything greater than 6 is positive, all else negative. So as we see, the points misclassified here are A, B, D & E, thus the error is 4/5(1 - 1/5 from condition 3)
Step 1: Initialize all weights to 1/N where N is the number of training samples. Since total number of training points are five, the resulting weight for each point is 1/5. So at first round table looks like:
Table 1
Step 2: Compute the error rate for each of the weak classifiers h which is E = Σwi that is summation of all the misclassified points. 

Table 2

Step 3: Pick the classifier that has the smallest error rate, implying X < 6 is the best.  

Step 4: So we need to calculate the voting power for the best weak classifier, i.e. X < 6 and the formula is 
                                                        ½ln((1 – є)/є) 
                                                           = ½ln((1-0.2)/0.2)
                                                           = ½ln4

So H(x) = sign(1/2ln4(X<6)) where X < 6 happens to be the short hand for the classifier X < 6, which can yield ±1.

Step 5; The question to ask here is: "Are we done?". How do we decide on that. There can be several criteria:

  1. H is good enough, eg. for large dataset may be 90%, but here we have only 5, so we need to achieve 100% 
  2. Enough round is done
  3. No good classifier left.
Apparently we are not done.

Step 6: Update the weights of each of the training point, with giving more weight to the misclassified points. So,
                                            new weight = 1/2 * 1/(1 – є) * old weight      if rightly classified
                                       
                                                                  = 1/2 * 1/є * old weight                otherwise

 Table 3
A few things to note here is C, the misclassified point has the maximum weight, and that total weight of correctly classified points is same as total weight of misclassified points, and the sum of all weights is 1.

Step 7: We now calculate the error rate with the new weights to find out the next best classifier. so here is the new calculation
Table 4
and we see that best classifier is X < 2, with a misclassification of 1/4

Step 8: So we need to calculate the voting power for the best weak classifier, i.e. X < 6 and the formula is 
                                                        ½ln((1 – є)/є) 
                                                           = ½ln((1-0.25)/0.25)
                                                           = ½ln3
Note a pattern here, if the misclassification is 1/n, the voting power is X*ln(1/n-1). So:
So H(x) = sign(1/2 * ln4(X<6) + 1/2 * ln3(X<2)). Here classifier X < 6 has higher voting power, meaning if there is a conflict, this wins. This implies C is still misclassified.

Step 9: Now we need to recalculate the weight. The two misclassified points are b&e with weight ratio 1:1, and the correctly classified ones are in ratio of 1:4:1. With this information in hand and also the fact that sum of weights of misclassified points should be equal to sum of weights of correctly classified points, we can calculate the weights after round three.

Table 5

Step 10: Now we need to recalculate the error rates to find the next best classifier. So here goes the calculation
Table 6
and we see, next best classifier is X > 4. Hence going by our pattern, it's voting power should be 1/2 * ln5.
So overall classifier is H(x) = sign(1/2 * ln4(X<6) + 1/2 * ln3(X<2) + 1/2 * ln5(X>4)). 

Notice now that the first classifier misclassifies C, the second one B & E and the third one A & D. This means that whenever one of them misclassifies a point, the other two can over power it by voting power because:

ln4 + ln5 > ln 3
ln3 + ln4 > ln5
ln3 + ln5 > ln4

Now the question, what would have happened if we gave the worst classifier a negative voting power. Well let's leave that to the reader as an exercise.


Reference: https://www.youtube.com/watch?v=gmok1h8wG-Q&t=1431s