INTRODUCTION
A horse-racing gambler, hoping to maximize his winnings, decides to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse, betting odds for each horse, etc.). To create such a program, he asks a highly successful expert gambler to explain his betting strategy. Not surprisingly, the expert is unable to articulate a grand set of rules for selecting a horse. On the other hand, when presented with the data for a specific set of races, the expert has no trouble coming up with a "rule of thumb" for that set of races (such as, "Bet on the horse that has recently won the most races" or "Bet on the horse with the most favored odds"). Although such a rule of thumb, by itself, is obviously very rough and inaccurate, it is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing. Furthermore, by repeatedly asking the expert's opinion on different collections of races, the gambler is able to extract many rules of thumb.
In order to use these rules of thumb to maximum advantage, there are two problems faced by the gambler: First, how should he choose the collections of races presented to the expert so as to extract rules of thumb from the expert that will be the most useful? Second, once he has collected many rules of thumb, how can they be combined into a single, highly accurate prediction rule?
Boosting refers to a general and provably effective method of producing a very accurate prediction rule by combining rough and moderately inaccurate rules of thumb in a manner similar to that suggested above.
OVERVIEW
In the previous blog post for Gradient Boosting, the idea of ensemble learning was explained in great details. Recall from that post the passing mention of two major concepts of ensemble learning, bagging and boosting. Adaptive Boosting, or Adaboost(as it is commonly known) algorithm of Freund and Schapire was the first practical boosting algorithm, and remains one of the most widely used and studied, with applications in numerous fields[1]. A lot of studies have been conducted and an equal amount of large variety of attempts have been made to understand AdaBoost as a learning algorithm, that is to understand why, how and when it works(or fails). The focus here is to understand the fundamental nature of learning that it applies. The second and more important focus is just like other posts to explain as lucidly as possible the mathematical jargon behind the algorithm to those who come from non mathematics background.
Just like other boosting algorithms, the output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. The basic idea can be summarized to the following steps:
- The aim is to combine different weak learners or classifiers (hi(x)) together to improve classification performance, where hi(x) is a single classifier
- Each weak learner is trained using a simple set of training samples, which can overlap. These sets are not disjoint
- Each sample has a weight that can be adjusted accordingly
- AdaBoost iteratively trains weak learners and calculate a weight for each one, and this weight represents the robustness of the weak learner
EXAMPLE
All figures courtesy [3]
Step 1:Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Step 7:
How it works:
- Given N training samples, X = {(x1; y1); (x2; y2) ... (xN; yN)}, xi is the ith sample and yi є {-1,1}is the class label for the sample i
- The parameters of AdaBoost classifier are initialized, the total number of iterations (T), type of weak learners (Decision trees or Neural networks), the weights wij
of each training sample, where wi represents the weights of the ith
iteration, and wi =
[wi1, wi2, …. wiN],
wij є [0,1], ΣNj=1 wij
= 1. Usually the weights are initialized to be equal as follows, wij = 1/N, j = 1,...,N.
- In AdaBoost, there are some rules:
- AdaBoost must have some of the weak learners, e.g., decision trees or neural networks classi ers,
- The error of each weak learner is must be better than a random classifier, i.e., the error rate must be lower than 50%
ALGORITHM
Now to break down the whole thing step by step.
Step 1:
To start with the first line, it means the training set contains m samples where xi are inputs and belong to the set X and where yi are outputs(labels, targets etc.) and belong to a set comprising of only two values, -1 (negative class) and +1 (positive class)
Step 2:
Second D = weights of samples and i = the ith training sample. In other papers, the D will be written as W. Thus the next statement reads initialize all weights of your samples to 1 divided by number of training sample, implying summation of all of them should be 1
Step 3:
We
continue this iteration until a) low training error is achieved, or b) preset
number of weak learners have been added (this is a parameter that is under our
control). We then take the final prediction by adding up the weighted
prediction of every classifier.
REFERENCES
Given: (x1,y1), … , (xm,ym)
where xi є X , yi є {-1,+1}.
Initialize: D1(i) = 1/m for i = 1,…,m.
For t = 1,…,T:
Train weak learner using distribution Dt
.
Get weak hypothesis ht :X ─› {-1,+1}
Aim: select ht with low
weighted error:
єt
= Pri~Dt [ht (xi) ≠ yi] :
Choose αt = 1/2 (ln((1 - єt)/
єt))
Update, for i = 1, … , m:
Dt+1(i)
= Dt(i)exp(-αt yi ht(xi))/Zt
where Zt is a normalization factor
(chosen so that Dt+1 will be a distribution).
Output the final hypothesis:
H(x)
= sign(ΣTt=1-αt ht(x))
Now to break down the whole thing step by step.
Step 1:
To start with the first line, it means the training set contains m samples where xi are inputs and belong to the set X and where yi are outputs(labels, targets etc.) and belong to a set comprising of only two values, -1 (negative class) and +1 (positive class)
Step 2:
Second D = weights of samples and i = the ith training sample. In other papers, the D will be written as W. Thus the next statement reads initialize all weights of your samples to 1 divided by number of training sample, implying summation of all of them should be 1
Step 3:
Now for the main loop. First the notations.
- Pr = probability
- ht = hypothesis/classifier
- ε = minimum misclassification error for the model
- α = weight for the classifier
- Zt = normalization factor, used to ensure that weights represent a true distribution
So as it happens the loop reads as for t=1 to T classifiers, fit it to the training data (where each prediction is either -1 or 1) and select the classifier with the lowest weighted classification error. Now for the formula to formally compute the value of ε.
where hj ≠ yi 1 if misclassified and 0 if correctly classified. Thus the formula reads: "Error
equals the sum of the misclassification rate, where weight for training sample
i and yi not being
equal to our prediction hj
(which equals 1 if misclassified and 0 if correctly classified).”
It
will require simple math to make sense of the formula. Consider having 4
different samples with weights 0.5, 0.2, 0.1 and 0.04. Imagine the classifier h
predicted values 1, 1, -1 ,and -1, but the actual output value y was -1,
1, -1, 1.
predicted: 1 1 -1 -1
actual: -1 1 -1 1
weights: 0.5 0.2 0.1 0.04
1 or 0: 1 0 0 1
This
leads to the following calculation for the misclassification rate:
misclassification rate / error = (0.5*1 + 0.2*0
+ 0.1*0 + 0.04*1) / (0.5 + 0.2 + 0.1 + 0.04)
error = 0.64285714285
Next step is to choose
weight for the classifier, α, by the formula that reads 1/2 * ln(1- error /
error). Simple math might explain better than words could here. Assume for
instance, that we have errors 0.30, 0.70, 0.5. Our classifier weights would be
calculated as follows:
ε = 0.3
α = 1/2 * ln(1- 0.3 / 0.3) = 0.42365
ε = 0.7
α = 1/2 * ln(1- 0.7 / 0.7) = -0.42365
ε = 0.5
α = 1/2 * ln(1- 0.5 / 0.5) = 0
Notice three
interesting observations:
- Classifier with accuracy higher than 50% results in a positive weight for the classifier (in other words, α > 0 if ε <= 0.5)
- Classifier with exact 50% accuracy is 0, and thus, does not contribute to the final prediction, and
- Errors 0.3 and 0.7 lead to classifier weights with inverse signs.
Now comes the very
important part of the equation: updating the weights for each sample. The point
of the update is to force classifiers to concentrate on observations that are
difficult to correctly classify. This is done by making misclassified cases to
be updated with increased weights after one iteration. Increased weights would
make the learning algorithm pay higher attention to these observations in the
next iteration. Conversely, correctly classified cases would receive a
decreased weight, and reduced attention by the classifiers in the next
iteration.
Again, with simple numbers as demonstration,
information uptake is painless. Error rate of 0.3 can be used to start with to
plug into the formula. Remember that the focus is on the low weighted
error(i.e. below 0.5). With
the low error rate, it will be interesting to observe what happens when cases
are misclassified and when cases are correctly classified.
And there you have
it! In the case of incorrect classification, the exp term became larger than 1,
and in the case of correct classification, the exp term became below 1.
Therefore, incorrect classification would receive higher weights, prompting our
classifiers to pay more attention to them in the next iteration, while the
opposite case of correct classification would result in the converse.
- A Short Introduction to Boosting by Yoav Freund Robert E. Schapire
- Explaining AdaBoost by Robert E. Schapire
- AdaBoost: an Overview by Alaa Tharwat
- https://towardsdatascience.com/adaboost-for-dummies-breaking-down-the-math-and-its-equations-into-simple-terms-87f439757dcf