· NOTE - find more details about all the above steps in [1]
A sketch of dependencies between the modules is presented in Fig. 1. It can be viewed in terms of classes and interfaces: each box represents an interface and enumerates some classes implementing the interface. In the fig. 1, the solid arrows represent submodule relation while the dotted ones show the modules used by the search process. Such an amount of extracted interfaces and their implementations as depicted which is satisfactory for building a quite large number of DT induction algorithms including the most popular ones and many others.
A decision tree is a classifier. Decision trees are a widely used technique in statistical learning, where they are constructed to fit an existing set of data, and then used to predict outcomes on new data. We start with π labeled “training records” of the form (πΏ,π) where πΏ is a π-dimensional vector of features describing the data we have, and π is a label we give this record. For example, our data may be characteristics about humans and our label may be their shoe size, so an example dataset could look like this
Each component of πΏ is called as “input variable”, π is called “dependent variable” or “target variable”, and each row in such a table is called a “training example”. Let us consider two input variables, such that πΏ=(π1, π2). Further, let’s assume there is an interesting value of π1 that we can split the dataset around and few interesting values of π2. Then, an example partitioning of our space of (π1, π2) values is depicted in the left side of Figure 2, and a decision tree corresponding to such a partitioning is shown in the right side of the figure. Given an unlabelled vector πΏ = (π1, π2), we first test whether π1>π. Then, if that turns out to be true, we test whether π2>π. This allows us to classify πΏ into the region π
4 or the region π
5. If we initially had that π1≤π, then we will test π2 against π and then against π, which allows us to further classify πΏ into one of the regions π
1,π
2, or π
3.
fig. 2 a) partition of 2D space in five regions b) classifying decision tree into one of the five regions [1]
Next, let π take on a single constant value for each of the regions π
1,..,π
5. Let ππ be the value chosen for the region π
π, and let πΌπ(πΏ) be an indicator function that equals 1 when πΏ∈π
π. This allows us to obtain a model that can predict π based on πΏ:
Obtaining such a model is the ultimate goal of training a decision tree. Same as the model represented in Fig. 2 as a partition of 2D space and as a decision tree.
The most basic process of training a decision tree on a dataset involves the following elements as,
1. The selection of attribute
2. Splits in the tree
3. Stop splitting a node and mark it terminal
4. The assignment of a label to each terminal node
Some algorithms add an element called pruning. There are many different ways of implementing splitting criteria, stopping criteria, and pruning methods. Splitting criteria are algorithmic rules that decide which input variable to split the dataset around next. Stopping criteria are rules that determine when to stop splitting the dataset and instead output a classification. Stopping criteria are actually optional, but in their absence, a trained tree would have a separate region for each training record. As this is undesirable, stopping criteria are used as a method of deciding when to stop growing the tree. Lastly, pruning methods are ways to reduce the size and complexity of an already trained tree by combining or removing rules that do not significantly increase classification accuracy. All three of these things directly affect the complexity of a tree, which can be measured according to various metrics such as tree height, tree width, and a number of nodes. It is desirable to train trees that are not overly complex because of the fact that simpler trees require less storage.
We now begin a description of several splitting criteria. In particular, we will discuss splitting criteria that only consider the values of a single input variable. Such criteria are called univariate splitting criteria. For a given dataset of training records D of the form (πΏ, π), we let ππ be a test on a record that considers the attribute ππ and has n outcomes π1, π2…ππ. An example of such a test would be π1>π which has two outcomes. Many different tests could exist for each input attribute. A given test will split the dataset D into π different subsets π·1, π·2…π·π. The following is a generic tree training algorithm.
Algorithm: Train Tree
Input: D, a dataset of training records of the form (X, Y).
Output: Root node R of a trained decision tree
1) Create a root node R
2) If a stopping criterion has been reached then label R with the most common value of π in π· and output R
3) For each input variable ππ in X
a. Find the test ππ whose partition π·1, π·2…π·π performs best according to the chosen splitting metric.
b. Record this test and the value of the splitting metric
4) Let ππ be the best test according to the splitting metric, let π be the value of the splitting metric, and let π·1, π·2… π·π be the partition.
5) If π<π‘βπππ βπππ
a. Label R with the most common value of π in π· and output R
6) Label R with ππ and make a child node πΆπ of R for each outcome ππ of ππ.
7) For each outcome Oi of ππ
a. Create a new child node πΆπ of R, and label the edge ππ
b. Set Π‘π = Train Tree(π·π)
8) Output R
|
Before we consider any individual splitting criterion, we need to define the notion of an impurity function, which is a start to understanding how well a given split organizes the dataset. An impurity function of a probability distribution π is a function π:π→β that satisfies a few constraints:
1. π attains its maximum if and only if π is a uniform distribution.
2. π attains its minimum if and only if some ππ=1 for some ππ∈π
3. π is symmetric about the components of π
4. π(π)≥0 for all π
The application to decision trees arises from the fact that at each node, when considering a split on a given attribute we have a probability distribution π with a component ππ for each class π of the target variable π. Hence we see that a split on an attribute it most impure if π is uniform, and is pure if some ππ=1, meaning all records that past this split are definitely of class π. Once we have an impurity function, we can define an impurity measure of a dataset D node π as so. If there are π possible values π¦1,π¦2,…,π¦π of the target variable π, and π is the selection operator from relational algebra then the probability distribution of π over the attribute π is
And the impurity measure of a dataset D is denoted as,
Lastly, we define the goodness-of-split (or change in purity) with respect to an input variable ππ that has π possible values π£1,..π£π and a dataset π· as,
Impurity based splitting criteria use an impurity function π plugged into the general goodness-of-split equation defined above.
Information Gain
Information gain is a splitting criterion that comes from information theory. It uses information entropy as the impurity function. Instead of the usual definition of information entropy as a function of a discrete random variable, we use a very similar definition of information entropy as a function of a probability distribution. This allows us to use it as our impurity function. Given a probability distribution π= (π1, π2...ππ), where ππ is the probability that a point is in the subset π·π of a dataset π·, we define the entropy π»:
Plugging in πΈππ‘ππππ¦ as our function π gives us πΌπππππππ‘ππππΊππππ (ππ, π·):
We can also define information gain as a function of conditional entropy. Suppose π and π are two distributions with π and π elements, respectively. The outcomes of the splits on π and π form a table π where the entry πππ contains the number of records in the dataset that survive both the ππ split and the ππ split. We let πππ=πππ|π·|
Then conditional entropy is the entropy of a record surviving the split π given that it survived π as follows. We restrict the π distribution to the column of π that contains the result of the split on π, and then average over all possible splits on π.
Plugging in the definition for conditional entropy into the above immediately yields the equation for information gain presented earlier. We can now prove an interesting property of information gain, showing that when using this splitting criterion we cannot make negative progress in classifying data under our target attribute.
PROPOSITION 1: Information Gain is non-negative.
PROOF: This follows pretty easily from Jensen’s inequality since –πππ(π₯) is convex.
Jensen’s inequality states that π(πΈ[π])≤πΈ[π(π)], where π is a random variable, πΈ is the expectation, and π is a convex function.
Letting π=πππ(π·) and π=ππ(π·) we get that splitting a dataset π· on an input variable ππ yields non-negative information gain with respect to a target variable π. The result of this proof seems to say that splitting on any variable should not make the model any worse, since information gain is non-negative. However, this is not entirely true because overfitting may occur.
PROPOSITION 2: Information Gain is symmetric, meaning that if we switch the split variable and target variable, we still get the same amount of information gain. Expressed in our notation as:
PROOF: In step (5) of the proof of Proposition 1 showed that
Since this expression is symmetric around ππ and ππ it follows that
One issue with using information gain as a splitting criterion is that it is biased towards tests that have many outcomes [9]. For example, suppose we are building a decision tree on a dataset of customers, to be able to predict some form of customer behavior in the future. If one of the input attributes is something unique to each customer, such as a credit card number, phone number, social security number, or any other form of ID, then the information gain for such an attribute will be very high. As a result, a test on this attribute may be placed near the root of the tree. However, testing on such attributes will not extend well to records outside of the training set, which in this case are new customers, since their unique identifiers will not be an outcome of the test.
Gain Ratio
As a result, Quinlan proposes a related splitting criterion, called Gain Ratio or Uncertainty Coefficient [9]. This serves to normalize information gain on an attribute Xi relative how much entropy this attribute has.
We can clearly see that in the preceding example with customer data, the denominator will be very large. Since the identifiers are unique, the denominator would be on the order of the number of customers in the dataset, whereas the numerator is at most the number of possible values for π. However, it must be noted that if πΈππ‘ππππ¦ (πππ(π·)) is very small, then this gain ratio will be large but ππ may still not be the best attribute to split on. Quinlan suggests the following procedure when using gain ratio as the criterion:
1. Calculate information gain for all attributes, and compute the average information gain
2. Calculate the gain ratio for all attributes whose information gain is greater or equal to the average information gain, and select the attribute with the largest gain ratio
Quinlan claims that this gain ratio criterion tends to perform better than the plain information gain criterion, creating smaller decision trees. He also notes that it may tend to favor attributes that create unevenly sized splits where some subset π·π of π· in the resulting partition is much smaller than the other subsets.
Gini Index
The Gini Index is another function that can be used as an impurity function. It is a variation of the Gini Coefficient, which is used in economics to measure the dispersion of wealth in a population [10]. It was introduced as an impurity measure for decision tree learning by Breiman in 1984. It is given by this equation
πΊπππ(π·) will be zero if some ππ=1 and it since each ππ<1, it will be maximized if all ππ are equal. Hence we see that the Gini Index is a suitable impurity function. The Gini Index measures the expected error if we randomly choose a single record and use it’s value of Y as the predictor, which can be seen by looking at the second formulation of the Gini Index, 1−Ξ£(ππ)2ππ=1. This can be interpreted as a difference between the norms of the vectors (1…1) and ππ¦ (π·). Also, in the first formulation, Ξ£ππ(1−ππ)ππ=1, it is simply the sum of π variances of π. Bernoulli random variables, corresponding to each component of the probability distribution ππ(π·). We define the goodness-of-split due to the Gini Index similar to what we did with information gain, by simply plugging in π(π·)= πΊππππ(π·) to obtain
It turns out that the Gini Index and information entropy are related concepts. Fig. 3 shows that they graphically look similar:
Fig. 3 Entropy and Gini Index graphed for two different values of J, which is the total number of classes of the target variable [3].
Misclassification Rate
Given a probability distribution π we define the misclassification rate to be πππ ππππ π ππππππ‘πππ(π)=1−maxπππ π€βπππ π=(π1,…,ππ) Again, it is easy to see that πππ ππππ π ππππππ‘πππ is indeed an impurity function. It is maximized when π is uniform and it is zero when some ππ=1. We then define πππ ππππ π ππππππ‘ππππΊπππ by plugging in π(π)=πππ ππππ π ππππππ‘πππ(π) into the goodness-of-split equation.
An interesting fact is that the Gini Index and the Misclassification Rate have the same maximum. Given that π has π elements, meaning that there are π possible values of the target variable, the maximum Misclassification Rate is 1−1π and the maximum Gini Index is
1−π∗(1/π)2 = 1−(1/π).
Since the Gini Index and Entropy impurity functions are both differentiable, they are preferred over the Misclassification Rate where Differentiability helps with numerical optimization techniques.
Other Splitting Criteria
There are many other splitting criteria that one can use, and a good overview of these is in chapter 9 of the Data Mining and Knowledge Discovery Handbook While we have only considered univariate splitting criteria, this book has a brief description of multivariate splitting criteria as well.
Stopping Criteria
Stopping criteria are usually not as complicated as splitting criteria. Common stopping criteria include:
1. Tree depth exceeds a predetermined threshold
2. Goodness-of-split is below a predetermined threshold
3. Each terminal node has less than some predetermined number of records
Generally stopping criteria are used as a heuristic to prevent overfitting. Overfitting is when a decision tree begins to learn noise in the dataset rather than structural relationships present in the data. An over-fit model still performs very well in classifying the dataset it was trained on, but would not generalize well to new data, just like the example with credit card numbers or other unique identifiers. If we did not use stopping criteria, the algorithm would continue growing the tree until each terminal node would correspond to exactly one record.
Conclusion
We have studied a general method of growing a decision tree using a greedy algorithm, with interesting analyses of several splitting criteria that can be used with this algorithm.
References
- J. Kacprzyk, Warsaw, Poland, "meta learning in decision tree induction" springer vol 498
- Lev dubinets “top-down decision tree inducers ”
- Jugal Kalita, Decision and Regression Tree Learning. Slides. Used for figure 3. http://www.cs.uccs.edu/~jkalita/work/cs586/2013/DecisionTrees.pdf