“Use decision tree!”. Be it in the industry or be it in kaggle competitions, it is seen that decision tree or at least the algorithms that are evolved from it are practiced religiously. Decision tree is a versatile machine learning method capable of performing both regression and classification tasks. Almost all real world problems are non-linear in nature and decision trees helps to get rid of non-linearity in the data. This algorithm is very intuitive, simple to understand and can be visually interpretable — something every business would want in first place.
HISTORY
Decision Tree Analysis is a general, predictive modelling tool that has applications spanning a number of different areas. In general, decision trees are constructed via an algorithmic approach that identifies ways to split a data set based on different conditions. In a publication produced by SAS, the utility of decision tree analysis was described as follows:
·         Decision trees are a form of multiple variable (or multiple effect) analysis.
·         All forms of multiple variable analysis allow us to predict, explain, describe, or classify an outcome (or target)
·         Multiple variable analysis capability of decision trees enables one to go beyond simple one-cause, one-effect relationships and to discover things in the context of multiple influences
·         Multiple variable analysis is particularly important in current problem-solving because almost all critical outcomes that determine success are based on multiple factors
One model for performing decision tree analysis was created by J. Ross Quinlan at the University of Sydney and presented in his book Machine Learning, vol.1, no. 1, in 1975 (2). His first algorithm for decision tree creation was called the Iterative Dichotomiser 3 (ID3). This algorithm was created based on the principles of Occam's razor, with the idea of creating the smallest, most efficient decision tree possible. Quinlan went on to further develop this model with his creation of the C4.5 algorithm, and finally the C5.0 algorithm. 
Lohand and Strobl [3, 4] provided a comprehensive review of the statistical literature of classification tree methods that may be useful to readers interested more about the statistical theories behind this method. Several authors have developed other decision tree methods to be employed when the endpoint is the prediction of survival. [5-12] Our discussion will be limited to cases in which the selection of input variables are based on statistical properties, but in the real world selection of input variables may be based on the relative cost of collecting the variables or on the clinical meaningfulness of the variables. Jin and colleagues [13, 14] introduced an alternative classification tree method that allows for the selection of input variables based on a combination of preference (Example based on cost) and non-inferiority to the statistically optimal split. Another extension of the decision tree method is to develop a decision tree that identifies subgroups of patients who should have different diagnostic tests or treatment strategies to achieve optimal medical outcomes.[15]
BASIC CONCEPT
A decision tree is a tree where each node represents a feature (attribute), each link (branch) represents a decision (rule) and each leaf represents an outcome (categorical or continues value). The whole idea is to create a tree like Fig (2) for the entire data and process a single outcome at every leaf (or minimize the error in every leaf).
Figure 2 illustrates a simple decision tree model. It consists of nodes which have parent-child relationships
Fig. 2- Sample decision tree outline
Picture courtesy - https://medium.com/data-science-group-iitr/
decision-trees-decoded-c70b4f7ff542
Nodes. 
There are three types of nodes. (a) A root node, also called a decision node, represents a choice that will result in the subdivision of all records into two or more mutually exclusive subsets. (b) Internal nodes, also called chance nodes, represent one of the possible choices available at that point in the tree structure, the top edge of the node is connected to its parent node and the bottom edge is connected to its child nodes or leaf nodes. (c) Leaf nodes, also called end nodes, represent the final result of a combination of decisions or events.
Branches
Branches represent chance outcomes or occurrences that emanate from root nodes and internal nodes. A decision tree model is formed using a hierarchy of branches. Each path from the root node through internal nodes to a leaf node represents a classification decision rule.
Pseudo-code for a function called createBranch() would look like this:
Check if every item in the dataset is in the same class:
If so return the class label
Else
find the best feature to split the data
split the dataset
create a branch node
for each split
call createBranch and add the result to the branch node
return branch node
Splitting
Only input variables related to the target variable are used to split parent nodes into pure child nodes of the target variable. Both discrete input variables and continuous input variables (which are collapsed into two or more categories) can be used. When building the model one must first identify the most important input variables, and then split records at the root node and at subsequent internal nodes into two or more categories or ‘bins’ based on the status of these variables. Characteristics that are related to the degree of ‘purity’ of the resultant child nodes (i. e. , the proportion with the target condition) are used to choose between different potential input variables, these characteristics include entropy, Gini index, classification error, information gain, gain ratio, and twoing criteria [16] This splitting procedure continues until pre-determined homogeneity or stopping criteria are met. In most cases, not all potential input variables will be used to build the decision tree model and in some cases a specific input variable may be used multiple times at different levels of the decision tree.
Stopping
Complexity and robustness are competing characteristics of models that need to be simultaneously considered whenever building a statistical model. An extreme situation is to build a very complex decision tree model that spreads wide enough to make the records in each leaf node 100% pure (i. e. all records have the target outcome). Such a decision tree would be overly fitted to the existing observations and have few records in each leaf, so it could not reliably predict future cases and, thus, would have poor generalizability (i. e. lack robustness). To prevent this from happening, stopping rules must be applied when building a decision tree to prevent the model from becoming overly complex. Common parameters used in stopping rules include: (a) the minimum number of records in a leaf. (b) The minimum number of records in a node prior to splitting; and (c) the depth (i. e.  number of steps) of any leaf from the root node. Stopping parameters must be selected based on the goal of the analysis and the characteristics of the dataset being used. As a rule-of-thumb, Berry and Linoff 
[4] recommend avoiding overfitting and underfitting by setting the target proportion of records in a leaf node to be between 0. 25 and 1. 00% of the full training data set.
Pruning.
In some situations, stopping rules do not work well. An alternative way to build a decision tree model is to grow a large tree first, and then prune it to optimal size by removing nodes that provide less additional information. A common method of selecting the best possible sub-tree from several candidates is to consider the proportion of records with error prediction (i. e. the proportion in which the predicted occurrence of the target is incorrect). Other methods of selecting the best alternative is to use a validation dataset (i. e. dividing the sample in two and testing the model developed on the training dataset on the validation dataset), or, for small samples, cross validation (i.e. dividing the sample in 10 groups or ‘folds’, and testing the model developed from 9 folds on the 10th fold, repeated for all ten combinations, and averaging the rates or erroneous predictions). There are two types of pruning, pre-pruning (forward pruning) and post-pruning (backward pruning). Pre-pruning uses Chi-square tests or multiple-comparison adjustment methods to prevent the generation of non-significant branches. Post-pruning is used after generating a full decision tree to remove branches in a manner that improves the accuracy of the overall classification when applied to the validation dataset.
Decision trees can also be illustrated as segmented space, as shown in Figure 2. The sample space is subdivided into mutually exclusive (and collectively exhaustive) segments, where each segment corresponds to a leaf node (that is, the final outcome of the serial decision rules). Each record is allocated to a single segment (leaf node). Decision tree analysis aims to identify the best model for subdividing all records into different segments.
TYPES OF TREE
Decision trees used are of two main types:
 
·         Classification tree analysis is when the predicted outcome is the class to which the data belongs. 
 
·         Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient's length of stay in a hospital).
 
The term Classification And Regression Tree (CART) analysis is an umbrella term 
used to refer to both of the above procedures, first introduced by Breiman et al [17]. Trees used for regression and trees used for classification have some similarities but also some differences, such as the procedure used to determine where to split. 
 
Some techniques, often called ensemble methods, construct more than one decision tree. Examples of such decision trees are:
·         Gradient Boosting tree: They incrementally build a sequence of (very) simple trees, where each successive tree is built for the prediction residuals of the preceding tree this method is mostly used in predictive data mining
·         Alternating decision tree. It generalizes decision trees and has connections to boosting. An ADTree consists of an alternation of decision nodes, which specify a predicate condition, and prediction nodes, which contain a single number. An instance is classified by an ADTree by following all paths for which all decision nodes are true, and summing any prediction nodes that are traversed. 
·         Bootstrap aggregated (or bagged) decision trees is an early ensemble method  that builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction A random forest classifier is a specific type of bootstrap aggregating 
·         Rotation forest in which every decision tree is trained by first applying principal component analysis (PCA) on a random subset of the input features 
·         Decision list is a one-sided decision tree. Here every internal node has exactly 1 leaf node and exactly 1 internal node as a child (except for the bottommost node, whose only child is a single leaf node). While less expressive, decision lists are arguably easier to understand than general decision trees 
·         Weight-Based this Decision Tree operator is a nested operator i.e. it has a subprocess. The subprocess must have an attribute weighting scheme i.e. an operator that expects an Example Set and generates attribute weights. If the Weight by Chi Squared Statistic operator is supplied for attribute weighting, this operator acts as the CHAID operator. CHAID stands for CHi-squared Automatic Interaction Detection. The chi-square statistic is a nonparametric statistical technique used to determine if a distribution of observed frequencies differs from the theoretical expected frequencies. Chi-square statistics use nominal data, thus instead of using means and variances, this test uses frequencies. CHAID's advantages are that its output is highly visual and easy to interpret. Because it uses multiway splits by default, it needs rather large sample sizes to work effectively, since with small sample sizes the respondent groups can quickly become too small for reliable analysis. The representation of the data as Tree has the advantage compared with other approaches of being meaningful and easy to interpret
ADVANTAGES
The decision tree method is a powerful statistical tool for classification, prediction, interpretation, and data manipulation that has several potential applications in medical and other research. Using decision tree models to describe research findings has the following advantages:
- Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
- Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.
- Less data cleaning required: It requires less data cleaning compared to some other modelling techniques. It is not influenced by outliers and missing values to a fair degree.
- Data type is not a constraint: It can handle both numerical and categorical variables.
- Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.
DISADVANTAGES
As with all analytic methods, there are also limitations of the decision tree method that users must be aware of. The main disadvantages are as follows:
1.       It can be subject to overfitting and underfitting, particularly when using a small data set. This problem can limit the generalizability and robustness of the resultant models
2.       Another potential problem is that strong correlation between different potential input variables may result in the selection of variables that improve the model statistics but are not causally related to the outcome of interest. Thus, one must be cautious when interpreting decision tree models and when using the results of these models to develop causal hypotheses
APPROPRIATE PROBLEMS FOR DECISION TREE LEARNING
Although a variety of decision tree learning methods have been developed with somewhat differing capabilities and requirements, decision tree learning is generally best suited to problems with the following characteristics:
Fig 3-Example Decision tree
Picture courtesy - www.medium.com
The above picture shows a decision tree for the concept Play Tennis. An example is classified by sorting it through the tree to the appropriate leaf node, then returning the classification associated with this leaf (in this case, Yes or No). This tree classifies Saturday mornings according to whether or not they are suitable for playing tennis.
Instances are represented by attribute-value pairs. Instances are described by a fixed set of attributes (e.g. Temperature) and their values (e.g. Hot). The easiest situation for decision tree learning is when each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold).
The target function has discrete output values. The decision tree in Figure 3 assigns a boolean classification (e.g., yes or no) to each example. Decision tree methods easily extend to learning functions with more than two possible output values. A more substantial extension allows learning target functions with real-valued outputs, though the application of decision trees in this setting is less common.
Disjunctive descriptions may be required. As noted above, decision trees naturally represent disjunctive expressions(What is disjunctive description?).
The training data may contain errors. Decision tree learning methods are robust to errors, both errors in classifications of the training examples and errors in the attribute values that describe these examples.
The training data may contain missing attribute values. Decision tree methods can be used even when some training examples have unknown values (e.g., if the Humidity of the day is known for only some of the training examples).Many practical problems have been found to fit these characteristics. Decision tree learning has therefore been applied to problems such as learning to classify medical patients by their disease, equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting on payments. Such problems, in which the task is to classify examples into one of a discrete set of possible categories, are often referred to as classification problems.
SOME IMPORTANT TERMINOLOGIES
  
•         We would like to select the attribute that is most useful for classifying examples.
- Information gain measures how well a given attribute separates the training examples according to their target classification.
-  ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.
- In order to define information gain precisely, we use a measure commonly used in information theory, called entropy
-  Entropy characterizes the impurity of an arbitrary collection of examples.
Mathematical formulation is explained in next blog
REFERENCES
2.  Quinlan RJ. C4.5 Programs for Machine Learning. San Mateo California Morgan Kaufmann Publishers, Inc. 1993
3.      Loh WY. Fifty years of classification and regression trees. Int Stat Rev. 2014;82(3): 329–348. doi: 10.1111/insr.12016. 
4.      Strobl C. Discussions. Int Stat Rev. 2014;82(3): 349–352. doi: 10.1111/insr.12059.
5.      Segal M. Regression trees for censored data. Biometrics. 1988;44: 35–47.
6.   Segal M, Bloch D. A comparison of estimated proportional hazards models and regression trees. Stat Med. 1989 8  539–550.
7.      Segal M. Features of tree-structured survival analysis. Epidemiol. 1997;8: 344–346.
8. Therneau T, Grambsch P, Fleming T. Martingale based residuals for survival models. Biometrika. 1990 77: 147–160.
9.   LeBlanc M, Crowley J. Relative risk trees for censored survival data. Biometrics. 1992 48: 411–425.
10.  Keles S, Segal M. Residual-based tree-structured survival analysis. Stat Med. 2002;21: 313–326.
11.   Zhang HP. Splitting criteria in survival trees. 10th International Workshop on Statistical Modeling.Innsbruck (Austria):Springer-Verlag;1995. p.305-314
12.  Jin H, Lu Y, Stone K, Black DM. Alternative tree structured survival analysis based on variance of survival time. Med Decis Making. 2004;24(6): 670–680. doi: 10.1177/0272989X10377117. 
13.  Jin H, Lu Y, Harris ST, Black DM, Stone K, Hochberg MC, Genant HK. Classification algorithm for hip fracture prediction based on recursive partitioning methods. Med Decis Making. 2004;24(4): 386–398. doi: 10.1177/0272989X04267009.
14.  Jin H, Lu Y. A procedure for determining whether a simple combination of diagnostic tests may be noninferior to the theoretical optimum combination. Med Decis Making. 2008;28(6): 909–916. doi: 10.1177/0272989X08318462. 
15.   Li C, Gluer CC, Eastell R, Felsenberg D, Reid DM, Rox DM, Lu Y. Tree-structured subgroup analysis of receiver operating characteristic curves for diagnostic tests. Acad Radiol. 2012;19(12): 1529–1536. doi: 10.1016/j.acra.2012.09.007.
16.  Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software. ISBN 978-0-412-04841-8. 
17.   Patel N, Upadhyay S. Study of various decision tree pruning methods with their empirical comparison in WEKA. Int J Comp Appl. 60(12):20–25.