Friday, January 3, 2020

Adaboost Example

DATA: CREDIT CARD DEFAULT DATA FROM HERE

Data Description:

Data Set Characteristics: Multivariate
Number of instances: 1000
Attribute Characteristics: Integer, Categorical
Number of Attributes: 20
Missing Values: N/A.


Dataset Description:
There are 20 variables, the last one being the class(1 or 2)

Attribute 1: (qualitative) Status of existing checking account
A11 : ... < 0 DM
A12 : 0 <= ... < 200 DM
A13 : ... >= 200 DM / salary assignments for at least 1 year
A14 : no checking account

Attribute 2: (numerical) Duration in month

Attribute 3: (qualitative) Credit history
A30 : no credits taken/ all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/ other credits existing (not at this bank)

Attribute 4: (qualitative) Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others

Attribute 5: (numerical) Credit amount

Attibute 6: (qualitative) Savings account/bonds
A61 : ... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 : .. >= 1000 DM
A65 : unknown/ no savings account

Attribute 7: (qualitative) Present employment since
A71 : unemployed
A72 : ... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 : .. >= 7 years

Attribute 8: (numerical) Installment rate in percentage of disposable income

Attribute 9: (qualitative) Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single

Attribute 10: (qualitative) Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor

Attribute 11: (numerical) Present residence since

Attribute 12: (qualitative) Property
A121 : real estate
A122 : if not A121 : building society savings agreement/ life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property

Attribute 13: (numerical) Age in years

Attribute 14: (qualitative) Other installment plans
A141 : bank
A142 : stores
A143 : none

Attribute 15: (qualitative) Housing
A151 : rent
A152 : own
A153 : for free

Attribute 16: (numerical) Number of existing credits at this bank

Attribute 17: (qualitative) Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer

Attribute 18: (numerical) Number of people being liable to provide maintenance for

Attribute 19: (qualitative) Telephone
A191 : none
A192 : yes, registered under the customers name

Attribute 20: (qualitative) foreign worker
A201 : yes
A202 : no


The data employed binary variable, i.e defaulted(2) or not(1). It is worse to class a customer as good when they are bad (2), than it is to class a customer as bad when they are good (1).

First, import the libraries:



Next, we read the file into a dataframe, set the name of the columns and start with EDA. First step is to check whether the data is highly unbalanced, i.e. defaulters to non defaulters ratio is too skewed or not.  


and the result as we see is:


So as is obvious, the data is not highly unbalanced as they are in a ratio of 7:3. 
Next we try to figure out effect of some of the attributes on whether a person defaults or not. We randomly select a few features and leave the rest as an exercise to the reader. it is important to note that these plots can be drawn one at a time, or take help of subplots.
  

Visualisation:

Some interesting findings:
Feature 1. Housing: Notice that the category of A124(unknown/no property) can be a risky customer.


Feature 2. Account status: Here the category of A11(< 0 DM) is found to be the maximum defaulter while A14(no checking account) are the safest bet the later part being a surprise



Feature 3. Credit History: Here things become really interesting. A30(no credits taken/ all credits paid back duly) and A31(all credits at this bank paid back duly) are surprisingly the worst performers


The result of other two features mentioned did not yield any meaningful insights. 

Next we can check by using scatter plots using pairs of variables whether we can get any clear results. For that we do a slight jugglery with the dataframe and separate the ones from the twos and concatenate the two together so that the two classes are clearly segregable as first 700 being the people who did not default as opposed to the next 300 who defaulted.  



Here we are using an example to draw a scatter plot using feature number 2 and 13 that is duration in months and age in years both being numerical data. 



The result if one notices does not reveal much. But that is not to deter the reader to try out other combinations even with categorical data. The same syntax can be followed to carry out the data analysis, just on first line instead of [1,12], they can use any other feature combination.

And now for the actual implementation of the algorithm. As we saw, not much was revealed from the analysis of data whatever we did. It is by no means exhaustive though. 

Implementation of the algorithm

Since not much information was revealed by the analysis, it can be safely assessed that all the features are equally important. But the dataset has too many categorical features. One way is to create dummy variable. Fortunately, the dataset is also presented in a way that all the values are numeric. It can be downloaded from here

So we read the data and complete the separation of train and test data.


Next step is to standardise the data. 

Last step is to implement the algorithm and check the misclassification and confusion matrix


Finally the result:

As is obvious data imbalance is causing severe problems. While the paid case, we are getting 80% accuracy, in the other case that is failure to pay is giving only 50% accuracy exposing the bank to lots of bad debt.