Tuesday, April 30, 2019

Adaptive Boosting a.k.a. Adaboost

INTRODUCTION

A horse-racing gambler, hoping to maximize his winnings, decides to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse, betting odds for each horse, etc.). To create such a program, he asks a highly successful expert gambler to explain his betting strategy. Not surprisingly, the expert is unable to articulate a grand set of rules for selecting a horse. On the other hand, when presented with the data for a specific set of races, the expert has no trouble coming up with a "rule of thumb" for that set of races (such as, "Bet on the horse that has recently won the most races" or "Bet on the horse with the most favored odds"). Although such a rule of thumb, by itself, is obviously very rough and inaccurate, it is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing. Furthermore, by repeatedly asking the expert's opinion on different collections of races, the gambler is able to extract many rules of thumb.

In order to use these rules of thumb to maximum advantage, there are two problems faced by the gambler: First, how should he choose the collections of races presented to the expert so as to extract rules of thumb from the expert that will be the most useful? Second, once he has collected many rules of thumb, how can they be combined into a single, highly accurate prediction rule? 

Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above.


OVERVIEW


In the previous blog post for Gradient Boosting, the idea of ensemble learning was explained in great details. Recall from that post the passing mention of two major concepts of ensemble learning, bagging and boosting. Adaptive Boosting, or Adaboost(as it is commonly known) algorithm of Freund and Schapire was the first practical boosting algorithm, and remains one of the most widely used and studied, with applications in numerous fields[1]. A lot of studies have been conducted and an equal amount of large variety of attempts have been made to understand AdaBoost as a learning algorithm, that is to understand why, how and when it works(or fails). The focus here is to understand the fundamental nature of learning that it applies. The second and more important focus is just like other posts to explain as lucidly as possible the mathematical jargon behind the algorithm to those who come from non mathematics background.

Just like other boosting algorithms, the output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. The basic idea can be summarized to the following steps:
  1. The aim is to combine different weak learners or classifiers (hi(x)) together to improve classification performance, where hi(x) is a single classifier
  2. Each weak learner is trained using a simple set of training samples, which can overlap. These sets are not disjoint
  3. Each sample has a weight that can be adjusted accordingly
  4. AdaBoost iteratively trains weak learners and calculate a weight for each one, and this weight represents the robustness of the weak learner
 EXAMPLE
All figures courtesy [3]
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Step 7:


How it works: 
  • Given N training samples, X = {(x1; y1); (x2; y2) ...  (xN; yN)}, xi  is the ith sample and yi є {-1,1}is the class label for the sample i
  • The parameters of AdaBoost classifier are initialized, the total number of iterations (T), type of weak learners (Decision trees or Neural networks), the weights wij of each training sample, where wi represents the weights of the ith iteration, and wi = [wi1, wi2, …. wiN], wij є [0,1], ΣNj=1 wij = 1. Usually the weights are initialized to be equal as follows, wij = 1/N, j = 1,...,N.
  • In AdaBoost, there are some rules:
    • AdaBoost must have some of the weak learners, e.g., decision trees or neural networks classi ers,
    • The error of each weak learner is must be better than a random classifier, i.e., the error rate must be lower than 50%


ALGORITHM

Given: (x1,y1), … , (xm,ym) where xi  є X , yi  є {-1,+1}.
Initialize: D1(i) = 1/m for i = 1,…,m.
For t = 1,…,T:
 Train weak learner using distribution Dt .
 Get weak hypothesis ht :X  ─›  {-1,+1}
Aim: select ht with low weighted error:
                                                єt = Pri~Dt [ht (xi) ≠ yi] :
Choose αt = 1/2 (ln((1 - єt)/ єt))
Update, for i = 1, … , m:
                        Dt+1(i) = Dt(i)exp(-αt yi ht(xi))/Zt
where  Zt is a normalization factor (chosen so that Dt+1 will be a distribution).
Output the final hypothesis:
                                                      H(x) = sign(ΣTt=1t ht(x))

Now to break down the whole thing step by step. 

Step 1:
To start with the first line, it means the training set contains m samples where xi are inputs and belong to the set X and where yare outputs(labels, targets etc.) and belong to a set comprising of only two values, -1 (negative class) and +1 (positive class)
Step 2:
Second D = weights of samples and i = the ith training sample. In other papers, the D will be written as W. Thus the next statement reads initialize all weights of your samples to 1 divided by number of training sample, implying summation of all of them should be 1
Step 3:
 Now for the main loop. First the notations. 
  •  Pr = probability
  •  ht = hypothesis/classifier 
  •  ε = minimum misclassification error for the model
  •  α = weight for the classifier
  •  Zt = normalization factor, used to ensure that weights represent a true distribution

 So as it happens the loop reads as for t=1 to T classifiers, fit it to the training data (where each prediction is either -1 or 1) and select the classifier with the lowest weighted classification error. Now for the formula to formally compute the value of  ε. 
where hj ≠ yi  1 if misclassified and 0 if correctly classified. Thus the formula reads: "Error equals the sum of the misclassification rate, where weight for training sample i and yi not being equal to our prediction hj (which equals 1 if misclassified and 0 if correctly classified).” 

It will require simple math to make sense of the formula. Consider having 4 different samples with weights 0.5, 0.2, 0.1 and 0.04. Imagine the classifier h predicted values 1, 1, -1 ,and -1, but the actual output value y was -1, 1, -1, 1.
predicted: 1 1 -1 -1
actual: -1 1 -1 1
weights: 0.5 0.2 0.1 0.04
1 or 0: 1 0 0 1
This leads to the following calculation for the misclassification rate:
misclassification rate / error = (0.5*1 + 0.2*0 + 0.1*0 + 0.04*1) / (0.5 + 0.2 + 0.1 + 0.04)
error = 0.64285714285


Next step is to choose weight for the classifier, α, by the formula that reads 1/2 * ln(1- error / error). Simple math might explain better than words could here. Assume for instance, that we have errors 0.30, 0.70, 0.5. Our classifier weights would be calculated as follows:
ε = 0.3
α = 1/2 * ln(1- 0.3 / 0.3) = 0.42365
ε = 0.7
α = 1/2 * ln(1- 0.7 / 0.7) = -0.42365
ε = 0.5
α = 1/2 * ln(1- 0.5 / 0.5) = 0

Notice three interesting observations:
  1. Classifier with accuracy higher than 50% results in a positive weight for the classifier (in other words, α > 0 if ε <= 0.5) 
  2. Classifier with exact 50% accuracy is 0, and thus, does not contribute to the final prediction, and 
  3. Errors 0.3 and 0.7 lead to classifier weights with inverse signs.

Now comes the very important part of the equation: updating the weights for each sample. The point of the update is to force classifiers to concentrate on observations that are difficult to correctly classify. This is done by making misclassified cases to be updated with increased weights after one iteration. Increased weights would make the learning algorithm pay higher attention to these observations in the next iteration. Conversely, correctly classified cases would receive a decreased weight, and reduced attention by the classifiers in the next iteration.

Again, with simple numbers as demonstration, information uptake is painless. Error rate of 0.3 can be used to start with to plug into the formula. Remember that the focus is on the low weighted error(i.e. below 0.5). With the low error rate, it will be interesting to observe what happens when cases are misclassified and when cases are correctly classified.



And there you have it! In the case of incorrect classification, the exp term became larger than 1, and in the case of correct classification, the exp term became below 1. Therefore, incorrect classification would receive higher weights, prompting our classifiers to pay more attention to them in the next iteration, while the opposite case of correct classification would result in the converse.


We continue this iteration until a) low training error is achieved, or b) preset number of weak learners have been added (this is a parameter that is under our control). We then take the final prediction by adding up the weighted prediction of every classifier. 



REFERENCES
  1. A Short Introduction to Boosting by Yoav Freund Robert E. Schapire
  2. Explaining AdaBoost by Robert E. Schapire
  3. AdaBoost: an Overview by Alaa Tharwat
  4. https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf