Monday, June 11, 2018

Decision Tree Example

Example 1


DATA - Titanic dataset from www.kaggle.comThe data has been split into two groups:
  •         training set (train.csv)
  •          test set (test.csv)
Data Dictionary
Variable Definition Key survival Survival 0 = No, 1 = Yes 
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd 
sex ,Sex Age = Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / children aboard the Titanic ticket Ticket number fare Passenger fare cabin Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)

1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson

Some children traveled only with a nanny, therefore parch=0 for them.

Predict Survival in the Titanic Data Set

predictions are made using a decision tree for the Titanic data set downloaded from Kaggle. This dataset provides information on the Titanic passengers and can be used to predict whether a passenger survived or not.

STEP 1

load the Titanic dataset file downloaded from Kaggle.


 OUTPUT



 STEP 2 
Now Label the data 


 STEP 3
convert sex column which is string into binary i.e. 0 and 1 using map option
 
STEP 4
Now Drop the empty column i.e missing values


STEP 5

Now split the data into train  and test dataset using sklearn module



STEP 6
to apply a decision tree to titanic dataset import the tree from sklearn  name it as a model 



STEP 7
Now take look on model (decision tree) attributes and fit the data with model

 OUTPUT
Decision tree model uses Gini index by default for maximum gain of information one can change it into entropy as per the requirement, also choose the max_depth of tree etc. 


STEP 8
Now predict the values on test data set



STEP 9
After predicting on test data set find out the accuracy score using module of name sklearn.metrics  and print the accuracy



OUTPUT

STEP 10
Now print the confusion matrix 



OUTPUT



STEP 11
to get the graphical form of the decision tree to install  http://www.graphviz.org/  and explore the decision tree depth. save the file with name 'survival_tree.dot' the file will save in the same path of the main file with .png extension 



OUTPUT



let's have look on ZOOMED part of .png file



Let’s follow this part of the tree down, the nodes to the left are  True and the nodes to the right are False :
  1. figure out, 19 observations left to classify: 9 did not survive and 10 did.
  2. From this point the most information gain is how many siblings (SibSp) were aboard.
    A. 9 out of the 10 samples with less than 2.5 siblings survived.
    B. This leaves 10 observations left, 9 did not survive and 1 did.
  3. 6 of these children that only had one parent (Parch) aboard did not survive.
  4. None of the children aged > 3.5 survived
  5. The 2 remaining children, the one with > 4.5 siblings did not survive.



Example 2


Contributed by Minal Khandare (minalkamble327@gmail.com)
DATA:
Dataset comes from the UCI Machine Learning Repository and was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance * left-weight) and (right-distance * right-weight). If they are equal, it is balanced.
Balance Scale dataset consists of 5 attributes, 4 as feature attributes and 1 as the target attribute. let us try to build a classifier for predicting the Class attribute. The index of target attribute is 1st.
1. Class Name: 3 (L, B, R)
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)

Step 1:
Import all the necessary libraries  



Step 2:
For importing the data and manipulating it,  use pandas data frames. First download the dataset. After downloading the data file, read the file as pd.read_csv() method to import data into pandas data frameThere are total 624 instances in the balance_scale data file.


Output:


There is no header parameter in data, thus labeling can be done using head( ) function.

 
Output:


Step 3:

Here, the ‘X’ set represents predictor variable and it consists of data from 2nd to 5th columns.  The ‘Y’ set represents target variable and it consists of data in a 1st column.


 “.values” of numpy are used for converting dataframes into numpy arrays.


Sklearn’s train_test_split() method is used to split training and test data set. X_trainy_train are training data &  X_testy_test belongs to the test dataset. The test_size 0.3 indicates that test data will be 30% of the whole dataset and training data will be 70%.



Step 4:



 Now implement Decision tree algorithm on training data set.

'criterion' defines the function to decide the quality of a split. use “entropy” for measuring the information gain.
“max_depth” indicates the maximum depth of the tree.
“min_samples_leaf” indicates the minimum number of samples required at a leaf node.



“criterion" Decision tree is visualized using pydotplus. Visualization will be saved to an image.

OUTPUT 



ZOOM part of balance weight decision tree






No comments:

Post a Comment