Tuesday, March 12, 2019

Example on Gradient Boosting Method (GBM)

DATA: CREDIT CARD DEFAULT DATA FROM HERE

Data Set Characteristics: Multivariate
Number of instances: 30000
Attribute Characteristics: Integer
Number of Attributes: 23
Missing Values: N/A.



Dataset Description:

There are 25 variables:
  • ID: ID of each client
  • LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
  • SEX: Gender (1=male, 2=female)
  • EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
  • MARRIAGE: Marital status (1=married, 2=single, 3=others)
  • AGE: Age in years
  • PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
  • PAY_2: Repayment status in August, 2005 (scale same as above)
  • PAY_3: Repayment status in July, 2005 (scale same as above)
  • PAY_4: Repayment status in June, 2005 (scale same as above)
  • PAY_5: Repayment status in May, 2005 (scale same as above)
  • PAY_6: Repayment status in April, 2005 (scale same as above)
  • BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
  • BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
  • BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
  • BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
  • BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
  • BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
  • PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
  • PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
  • PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
  • PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
  • PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
  • PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
  • default.payment.next.month: Default payment (1=yes, 0=no)
This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable.

First import the libraries:

Data Exploration:

Let us start by reading the file and by exploring the imbalance related to the target that is the number of people who defaulted vs who did not.

The result:

A number of 6,636 out of 30,000 (or 22%) of clients will default next month. We will go ahead with the fact that the data is not too unbalanced(for now).
Carrying on with our data exploration, let's check the effect of education, marital status and sex on the defaulting payment

And the results, which as obvious do not project much light on the results

Next let us check if there is any correlation between the features. So we start with repayment status, i.e. PAY_X columns.

And here is the result:

If one notices carefully, there is a correlation between the payments made. The more the temporal distance, the lesser the value, the strongest correlation being between neighboring months. Such correlation are sure to be found among the other twelve features comprising of the bill statement and the previous payment This actually calls for something called PCA(Principal Component Analysis) or Kenel PCA. but we are yet to cover it, so let us leave it for some future post. 

But what would be the result, if we include the target column?  Let's have a look into it. We add that column before drawing the heatmap
The result looks like this:

As it is obvious there is a direct correlation between the default and the PAY_X columns. The reader may try with other features too, but for now we go ahead with these columns. Another way, of course is to use random forest to take out important features. But the moot point is if there is a strong correlation between features, it is important that we apply PCA, which we are skipping for now.

Next is segregating the features and the targets and doing the train test split.


We will not be delving further deep into this. Next is fitting, predicting and checking the misclassification using confusion matrix


And the final result as it stands: