Tuesday, November 20, 2018

Gradient Boosting

“An efficient algorithm for converting relatively poor hypotheses into very good hypotheses”…………………..Michael Kearns
INTRODUCTION

            Let’s go back to Decision Trees example of whether to play golf or not.  Before applying a decision tree algorithm to data, we have asked a series of questions and there are several factors we must take into consideration, for classifying each question into a yes or no answer. This is where Ensemble Methods come in handy! Rather than just relying on one Decision Tree and hoping we made the right decision at each split, Ensemble Methods allow us to take a sample of Decision Trees into account, calculate which features to use or questions to ask at each split, and make a final predictor based on the aggregated results of the sampled Decision Trees. When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is an irreducible error). An ensemble is just a collection of predictors which come together (e.g. mean of all predictions) to give a final prediction.

            Ensembling techniques are further classified into Bagging and Boosting. Another way to think about Ensemble learning is Fable of blind men and elephant. All of the blind men had their own description of the elephant. Even though each of the descriptions was true, it would have been better to come together and discuss their understanding before coming to the final conclusion. This story perfectly describes the Ensemble learning method.

Fig. 1 picture courtesy
https://cdn-images1.medium.com/max/1200/1*10t9S7xvWE5Z3NEZrmHG2w.jpeg

A weak hypothesis or weak learner is defined as one whose performance is at least slightly better than random chance. The idea of boosting came out of the idea of whether a weak learner can be modified to become a strong learner. A strong learner is a classifier that is arbitrarily well-correlated with the true classification. Using techniques like Bagging and Boosting helps to decrease the variance and increased the robustness of the model. Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers and may produce a more reliable classification than a single classifier. The accuracy of a predictive model can be boosted in two ways:
a. Either by embracing feature engineering or
b. By applying boosting algorithms straight away.
There are many boosting algorithms like
·         Gradient Boosting
·         XGBoost
·         AdaBoost
·         Gentle Boost etc.
Every boosting algorithm has its own underlying mathematics. Also, a slight variation is observed while applying them.Boosted algorithms are used where we have plenty of data to make a prediction. And we seek exceptionally high predictive power. It is used to for reducing bias and variance in supervised learning.in this blog we will see only boosting algorithm and bagging in next blog.

HISTORY


Fig. 2 Picture curtesy
http://www-stat.stanford.edu/hastie/TALKS/boost

            If a learning algorithm outputs a hypothesis whose performance is only slightly better than random guessing, [i.e. a weak learner] which implies the existence of an efficient algorithm that outputs a hypothesis of arbitrary accuracy [i.e. a strong learner [1]]. Algorithms that achieve hypothesis boosting quickly became simply known as "boosting". Freund and Schapire's arcing [5] as a general technique, is more or less synonymous with boosting [6]. Simultaneously with the more general functional gradient boosting perspective of Lew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean [7, 8]. The latter two papers introduced the view of boosting algorithms as iterative functional gradient descent algorithms.

Mathematical formulation


GRADIENT BOOST ALGORITHM FUNCTION ESTIMATION

Consider the problem of function estimation in the classical supervised learning setting. The fact that the learning is supervised leaves a stronger restriction on the researcher, as the data has to be provided with the sufficient set of proper targets (which can be very costly to extract e.g., come form an expensive experiment). We arrive with the dataset (x, y)Ni=1, where x = (x1, . . . , xd) refers to the explanatory input variables and y to the corresponding labels of other response variables. The goal is to reconstruct the unknown functional dependence x fy with our estimate  f (x), such that some specified loss function ψ(y, f )is minimized [9].
Please note that at this stage, we don’t make any assumptions about the form of neither the true functional dependence f(x), nor the form of the function estimate f(x). If we rewrite the estimation problem in terms of expectations, the equivalent formulation would be to minimize the expected loss function over the response variable Ey[y, f (x)]), conditioned on the observed explanatory data x:

The response variable y can come from different distributions. This naturally leads to specification of different loss functions one can arbitrarily specify the loss function ψ and the base- learner models on demand. In practice, given some specific loss function ψ(y, f ) and/or a custom base-learner h(x, θ), the solution to the parameters estimation can be difficult to obtain. To deal with this, it was proposed to choose a new function h(x, θt) to be the most parallel to the negative gradient {gt(xi)}Ni=1 along the observed data:

Instead of looking for the general solution for the boost increment in the function space, one can simply choose the new function increment to be the most correlated with gt(x). This permits the replacement of a potentially very hard optimization task with the classic least-squares minimization one.

To summarize, we can formulate the complete form of the gradient boosting algorithm, as originally proposed by Friedman (2001). The exact form of the derived algorithm with all the corresponding formulas will heavily depend on the design choices of ψ(y, f ) and h(x, θ). One can find some common examples of these algorithms in Friedman (2001).

LOSS-FUNCTION FAMILIES

Given a particular learning task, one can consider different loss functions (y, f ) to exploit. This choice is often influenced by the demand of specific characteristics of the conditional distribution. The most frequent examples of such property is the robustness to outliers, but other opportunities can also be considered. To use an arbitrary loss function, one has to specify both the loss function and the function to calculate the corresponding negative gradient. Given these two functions, they can be directly substituted in to the GBM algorithm. In practice, many of the loss functions have already been derived for the GBM algorithm (Friedman, 2001 Schmid and Hothorn, 2008 Schmidetal. 2011). Loss-functions can be classified according to the type of response variable y. Specific boosting algorithms have been derived for various families of the response, among which are the regression, classification and time-to-event analysis tasks. Depending on the family of response variable y we can systemize the most frequently used loss-functions as follows:
1.       Continuous response, y R:
a.      Gaussian L2 loss function
b.      Laplace L1 loss function
c.       Huber loss function, δ specified
d.      Quantile loss function, α specified
2.      Categorical response, y {0, 1}:
e.      Binomial loss function
f.        Ada boost loss function
3.      Other families of response variable:
g.      Loss functions for survival models
h.     Loss functions counts data
All the loss function are describe in Alexey et. al. [9]. From the computational perspective, this type of modelling would only result in increasing the number of different GBM models built by the number of desired statistics of the conditional distribution. However, it must be kept in mind that the resulting confidence intervals are a model approximation rather than true statistics. It is also important to note that the learned quantile models do not have to be necessary consistent with each other, as they are all learned separately. it is true that at some point in the learning process we can overestimate the model-complexity and thus overfit the data with both types of loss functions. However, due to the nearly linear impact of outliers to the Bernoulli loss, the Bernoulli model is typically less-sensitive to this type of erroneous labelling in the data.

SPECIFYING THE BASE-LEARNERS

A particular GBM can be designed with different base-learner models on board. Adiverse set of base-learners have been introduced in the literature thus far In this subsection, we will briefly describe and illustrate the base-learner models that are most frequently used in practice. The commonly used base-learner models can be classified into a three distinct categories: linear models, smooth models and decision trees. There is also a number of other models, such as markov random fields (Dietterichetal. 2004) orwavelets (Viola and Jones,2001), but their application arises for relatively specific practical tasks. The base-learner model systematization with the corresponding examples of functions is organized as follows:
1.       Linear models:
a.      Ordinary linear regression
b.      Ridge penalized linear regression
c.       Random effects
2.      Smooth models:
a.      P-splines
b.      Radial basis functions
3.      Decision trees
a.      Decision tree stumps
b.      Decision trees with arbitrary interaction depth
4.      Other models:
a.      Markov Random Fields
b.      Wavelets
c.       Custom base-learner functions
An important design opportunity is that nothing prevents the practitioner from specifying a complex model, utilizing several classes of base-learner models in one GBM. This means that the same functional formula can include both smooth additive components and the decision trees components at the same time. Another example would be to split the explanatory variables into the categorical and smooth subspaces and fit different boosted base learner models to each of the subspaces simultaneously. Another important feature of the base learner specification is that they can be designed for different models of variable interactions. If we consider the ANOVA decomposition of the function estimate f, different interaction terms would correspond to the different interrelationships between explanatory variables:

Further explanation of GBM base learner is beautifully explained in [9, 10].

Steps of Gradient Boost algorithm
Boosting is an ensemble technique in which the predictors are not made independently, but sequentially.
1.       Assume mean is the prediction of all variables.
2.      Calculate errors of each observation from the mean (latest prediction).
3.      Find the variable that can split the errors perfectly and find the value for the split. This is assumed to be the latest prediction.
4.      Calculate errors of each observation from the mean of both the sides of split (latest prediction).
5.      Repeat the step 3 and 4 till the objective function maximizes/minimizes.
6.      Take a weighted mean of all the classifiers to come up with the final model.

APPLICATION
  •           Gradient boosting can be used in the field of learning to rank.
  •           The commercial web search engines Yahoo and Yandex use variants of gradient boosting in their machine-learned ranking engines.



REFERENCES
  1. Michael Kearns.Thoughts on Hypothesis Boosting, Unpublished manuscript (Machine Learning class project, December 1988)
  2. Michael Kearns, Leslie Valiant (1989). "Crytographic limitations on learning Boolean formulae and finite automata". Symposium on Theory of computing. ACM.  Pp.  433-444
  3. Schapire, Robert E. (1990). The Strength of Weak Learnability. Machine Learning. Boston, MA: Kluwer Academic Publishers. Pp. 197-227
  4. Leo Breiman (1998). "Arcing classifier (with discussion and a rejoinder by the author)". Ann. Stat.  pp 801-849
  5. Yoav Freund and Robert E. Schapire.A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences.pp. 119-139
  6. Leo Breiman. Arcing Classifier (with Discussion and a Rejoinder by the Author), Annals of Statistics.pp. 801-849
  7. Mason, L Baxter, J Bartlett, P. L.Frean, Marcus .Bosting Algorithms as Gradient Descent. In S.A. Solla and T.K. Leen and K. Müller. Advances in Neural Information Processing Systems 12. MIT Press. pp. 512–518.
  8. Mason, L., Baxter, J. Bartlett, P. L. Frean, Marcus. Boosting Algorithms as Gradient Descent in Function Space.
  9. Alexey Natekin and Alois Knoll. Gradient boosting machines a tutorial. METHODS ARTICLE published 04 December 2013
  10. Schmid, M., and Hothorn, T. (2007). Boosting Additive Models Using Component-Wise p-Splines. Technical Report, Department of Statistics, University of Munich.

Monday, November 5, 2018

Random Forest Breast cancer

DATA: Breast cancer dataset from here

Data Set Characteristics: Multivariate
Number of instances: 699
Attribute Characteristics: Integer
Number of Attributes: 10
Missing Values: Yes.

Dataset description

  1. Sample Code Number: ID
  2. Clump Thickness: 1-10
  3. Uniformity of cell size: 1-10
  4. Uniformity of cell shape: 1-10
  5. Marginal Adhesion: 1- 10
  6. Single Epithelial Cell Size: 1-10
  7. Bare Nuclei: 1-10
  8. Bland Chromatin: 1-10
  9. Normal Nucleoli: 1-10
  10. Mitosis: 1-10
  11. Class:(2 for benign, 4 for malignant)


This dataset provides us with a new challenge in the sense that it contains missing data. So far in this blog, we have never worked with any dataset where data is missing. So this calls for some data exploration. We need to gather some information about the data, that is it calls for some EDA(Exploratory Data Analysis). But first import the libraries: 

Step 1: 

Step 2: 
Read the data and check for interesting information, i.e. missing data or rather which all columns contain such stuff. 



Step 3:
Check the output. Here we find that while all other fields are showing integer which should be the case, Bare_nuclei is shown as an object, thus identifying the culprit. 



Step 4: 
Here we require some jugglery with a regular expression to find out what exactly is there instead of missing data. It may be tempting to just open the file and check manually, but that can be messy if the dataset is large. So here we go:


and here is the output, showing that they have been populated with '?' and total missing data is 16

Step 5: 
Here we have two options, either drop the rows having missing data, or use Imputer class from sklearn.preprocessing. But there is a problem with the second option. In case dataset is skewed in favor of either class and this feature happens to be the most influential, we may end up getting too many misclassifications. Also notice that the total number is 16, whereas the dataset is 699. Hence percentage is quite low. Hence we go for the first option and separate the target from the features, i.e. decide on Xs and ys


Step 6: 
Now just to satisfy ourselves, let's take a look at the correlation between the class and the different features. But there is a problem, and that is Bare_Nuclei is still not an integer, but an object. So we need to convert it into an integer, or else it will never appear in our heatmap. So that's what we do in this part of the code. 


As the result shows, our fear about Bare Nuclei turns out to be true. It does play a major role in classification.


Step 7: 
Now, we can safely put the rest of the work in our template. Split the data, standardize them, fit the training data and finally check the misclassification 



Result